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ABSTRACT 

This eTaluatlTe and deTelopaental stady was 
andertaken between 1972*74 to deteralne the effectiveness of Itees 
used for the Test of English as a Foreign Language (TOEFL) In 
relationship to other itea types used in assessing English 
proficiency, and to recoeeend possible changes in TOEFL content and 
foriat. TOEFL was developed to assess the English proficiency of 
non^native English-speaking students applying to inctitutions of 
higher education in the Onited States. Questions of validation, 
criterion selection and content specification vere first investigated 
before nine written and oral TOEFL itee foreats were evaluated for 
possible use in a revised test. Both original and new forsats were 
adeinistered to 9B Peruvian, 145 Chilean and 199 Japanese subjects in 
their native countries. Open ended response Measures and aultiple 
choice Measures were exaeined. Intercorrelatlons asong test scores 
indicated that the test could be revised to Incorporate three Instead 
of five cowponents: (1) listening coeprehension: (2) English 
structure and writing ability: (3) reading coeprehension and 
vocabulary in context. Four objective subtests aised at increasing 
TOEFL effectiveness, and tailored criterion Measures of English 
productive skills, speaking and writing were also developed. (ABF) 
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FOREWORD 



The Test of English as a Foreign Language (TOEFL) Is well known " 
among university officials and others who are concerned with the ad- 
mission of foreign students who are nonnatlve speakers of English to 
Institutions of higher education In the United States and Canada. It 
Is certainly well known among the thousands of foreign students who 
take It each year, In many foreign countries, as one of the require- 
ments for entrance to American colleges and universities. Some of 
them may think of It as a "devil" of a test — for the acronym Is all 
too close to the German word for devil ! 

All these people, the foreign students and the university 
officials, have a right to expect that the TOEFL Is the fairest, 
most accurate, and most valid test of Its type that can be devised* 

Constructing such a test Is not easy. Questions have frequently 
been raised as to whether the TOEFL Is In .act as good as It might be. 
Because the test has remained In essentially the same form over many 
years, some people may have arrived at the Impression that Educational 
Testing Service — the organization responsible for the test — Is 
resistant to making changes In It. 

The present monograph will give the lie to such an Impression. 
It reports an extensive research study that was designed to explore 
possible changes In the format and content of the TOEFL. It Illus- 
trates several points: that It Is extremely difficult and expensive 
to conduct a really thoroughgoing study of possible changes; that 
many apparently reasonable suggestions for change turn out not to be 
so valid and feasible after all; and that nevertheless, some changes 
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prove to be promising, desirable, and feasible. Most of all, however, 
it indicates, at least to me, that the essential philosophy and 
direction of the TOEFL as it now exists, or as it might be modified 
in certain ways suggested here, is sound and credible. 

Teachers of English (and perhaps teachers of other languages) , 
as well as language testing specialists, will find much of interest 
and value in this monograph. I welcome Dr. Pike's monograph as a 
substantial contribution to the field. 



John B. Carroll 
University of North Carolina 
at Chapel Hill 
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PURPOSE AND BACKGROUND 

A major purpose of the Test of English as a Foreign Language (TOEFL) 
Is to provide Information useful to colleges and universities In making 
decisions regarding the admission, placement, and possible assignment to 
special language Instruction of foreign students planning to study In the 
United States and Canada* Its unique role Is to assess the foreign 
student's competence In English that he vlll need In order to successfully 
pursue a program of studies at a college or university where English Is 
the medium of Instruction* 

The overall purpose of the present study was to obtain Information 
useful for evaluating and revising TOEFL content and content specifica- 
tions* To achieve this purpose, questions of validation, criterion 
selection, and content specifications were Investigated* 

The tasks Involved In carrying out this study were developmental as 
well as evaluative. Thus, four objective subtests that might Increase 
the effectiveness of TOEFL, and tailored criterion measures of English 
productive skills, speaking and writing, were developed. Also Included 
In the study were two open-tended response measures which have shown 
particular promise for testing English as a foreign language. One Is a 
rewriting task used by Kellogg Hunt (1970a, 1970b), and the other Is the 
"Cloze procedure" task (Taylor, 1953), In which subjects are Instructed 
to replace words that have been deleted from prose passages. 

Validation, criterion selection, content specifications » and related 
questions regarding TOEFL will be considered first. The subtests used In 
carrying out the study and the data resulting from the study will be pre** 
sented in later sections of this paper. 

-1- 
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Questions of Validity and Crlf rlon Selection 
In an evaluation of a test or Its components for possible revision , 
consideration must be given to questions of validity, rellabllltyi and 
practicality. In keeping with the basic purpose of this study, all three 
were considered, with special emphasis on the question of validity. Tied 
to the question of validation It that of criterion selection or develop* 
ment. In the present study, criterion selection Is, In turn, based on 
the purpose and content of the TOEFL examination, and on the goal of the 
validation study Itself. 
English Language Skills as Criteria 

Although TOEFL Is used as an admissions test. It, unlike other 
admissions tests such as the Scholastic Aptitude Test and the Graduate 
Record Examination, Is not intended to serve as a predictor of academic 
success, particularly as this Is measured by grade«-polnt average. 
Rather, TOEFL Is used to ascertain If foreign students have sufficient 
command of English to enable them to study at Institutions where English 
Is the medium of Instruction without being handicapped by Inadequate 
communication skills. The first Implication of the above considerations 
is that it is more appropriate to use measures of various areas of 
competence in English as criteria rather than some more inclusive index 
such as grade^polnt average in establishing the validity of TOEFL. A 
second implication of viewing the essential task of TOEFL as the assess** 
ment of current English language skills, rather than as the prediction 
of future academic success, is that the validation procedure becomes one 
of concurrent instead of predictive validity. 
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At the tlse this study was conducted, the TOEFL examination vaa aade 
up of the following five aectlonSt each directed to one of the areas of 
competence in Engilsh as a second language: I. Listening Comprehension; 
II. English Structure; III. Vocabulary; IV. Reading Comprehension; and 
V. Writing Ability. The need for such differentiated test of second- 
language performance was described by Carroll (1968) : 

The problem of different areas of competence becomes much 
more acute In dealing vlth a second or foreign language » where 
the experiences of learners are likely not to be as homogeneous 
as those of native language learners, and where varlouc well-- 
known difficulties Interpose themselves In learnlng--the 
Interference of the native language, the slow progress due to 
the student*s lack of time or motivation for study, etc. 
Furthermore, since learning a second language often makes much 
more use of written material than does learning a native 
language (where reading Is rarely started until the spoken 
language Is fairly well mastered), competence In spoken and 
^nrltten aspects may develop somewhat Independently, and these 
competences must be separately assessed. It Is also much more 
Important to observe the distinction between productive and 
receptive skills because progress In these two aspects may not 
proceed pari passu as It ordinarily does In the native language. 
It Is quite possible for a competence to relate specifically to 
production and not to reception, or vice versa [p* 52]. 

Another partitioning of skills, cutting across the competencies of listen'* 
Ing and reading and of speaking and writing. Is that between vocabulary 
and structure. Of these six areas of English language competence, the 
TOEFL examination provides direct assessment of the receptive abilities, 
listening and reading, and of vocabulary and English structure, as well 
as indirect assessment of writing competence. Because of feasibility 
constraints, English speaking skills are not measured. Thus the TOEFL 
examination provided five scores considered potentially relevant for 
diagnostic as well as admissions interpretation « 

The differentiated nature of the TOEFL examination and the purpose 
of the present validation stxidy further define the question of criterion 
^selection. This study is not designed to compare TOEFL with other 
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examinatlons, nor is it limited to obtaining an index of how well it mea- 
sures overall performance in English as a second language. Rather, it is 
intended to indicate how well several component language competencies 
(including the speaking of English) are estimated by the TOEFL subtests 
and by the alternative experimental measures developed for this study. 
Thus, instead of one overall measure of English competence, a set of 
criterion measures directed to specific component skills is called for. 

The above discussion provides a rationale whereby the present study 
must involve concurrent validation, directed to six areas of English 
language competence. Before discussing criterion selection for these six 
areas, it will be helpful to review previous validation studies involving 
TOEFL, and to make certain theoretical observations about the problem of 
validation. 

Previous TOEFL Validation Studies 

A sunanary of predictive and concurrent validity studies involving 
TOEFL is provided m the booklet Test of English as a F oreign Langytage; 
Interpretive Information (1970). The predictive validity studies, with 
their emphasis on the grade-point average criterion, provide little 
information bearing on the evaluation and revision of TOEFL content 
specifications. More to the point are the concurrent validity studies 
summarized in the booklet, although most of these are focused on compar- 
ing TOEFL scores with scores on similar kinds of tests, such as the 
American Language Institute Test of Proficiency in English and the 
Michigan Test of English Language Proficiency. Among the concurrent 
validity studies summarized in the booklet, a study, carried out by the 
staff of the American Language Institute at Georgetown University in 
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coopcratlon with the staff of Educational Testing Service (Pitcher and 
Ra, 1967), is the most relevant to the questions posed in this paper. 
In that study, correlations were obtained between each TOEFL subtest and 
a criterion of Judged essay-writing performance. The correlations with 
the criterion were generally in the order one would logically expect, 
with Writing Ability and English Structure correlating highest (,74 and 
•74) and Listening Comprehension lowest (.56). 
Criterion Selection 

In most validation studies, the problem of criterion selection and 
development cannot be fully resolved. Even when ultimate criteria can 
be agreed upon and clearly stated, feasibility constraints typically 
require compromises of such nature that the criterion measures adopted 
are only relatively more direct than those being validated. Nevertheless, 
the reduced constraints of cost and time in an experimental study allow 
the use of criteria that can measure the target abilities more directly 
than those being validated, aixi that may have, as well, greater face 
validity. Furthermore, the circularity of using tests to validate tests 
may be partly offset by a consideration of construct validity. For 
example, the Pitcher and Ra finding that Writing Ability correlated higher 
with the essay-writing criterion than did most of the other TOEFL sections 
lends credence to Its construct validity as a writing measure beyond that 
implied by the size of the correlation itself. 

The receptive language skills of listening and reading appear to be 
quite directly measured by the Listening Comprehension and Reading Compre- 
hension sections of TOEFL. Although more direct measures of these skills 
which could serve as criteria can be readily conceived (using videotaped 
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classroom sessions to test listening comprehension, for example), they 
were not considered feasible within the constraints and scope of the pres- 
ent study. Therefore, no criterion measures were developed for listening 
and reading. For the productive language skills of speaking and writing, 
however, criterion measures were developed that called for actual perfor- 
mance In speaking and writing. In the spoken mode, English structure and 
vocabulary criteria were provided through judgments of the tape-recorded 
Interviews used to assess speaking ability, along specified dimensions. 

The Hunt rewriting task and the Cloze procedure tasks illustrate the 
relative nature of the question, "What is a criterion?" In the present 
study, each may be considered a quasi-criterion that was used to validate 
less direct, multiple-choice measures but was itself validated against 
even more direct measures of English language competence. 

Questions of Content Specifications 

The development of alternative TOEFL measures and the evaluation of 
these and of present TOEFL subtests were guided by questions concerning 
the content specifications. Some of these questions have been raised by 
the TOEFL Comnittee of Examiners and ETS staff members at various times 
since the inception of TOEFL; others derive from a close examination of 
TOEFL and from reviewing the relevant literature of language testing. 

For convenience, questions regarding the present TOEFL subtests will 
be presented first, followed by those related to the open-ended Hunt and 
Cloze procedure tasks and the alternative multiple-choice measures. Directions 
and sample items for each of these measures are provided in the Plan and 
Procedure section of this report. 
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Present TOEFL Sections 

In general, there Is an Interest In the reliability, efficiency, and 
concurrent validity of each of the TOEFL sections, and In the degree of 
construct validity Indicated by relationships vlthln the full set of 
estimator and criterion variables. For the TOEFL sections having two or 
three parts, each with a different Item format, the question of the rela- 
tive merits of each part Is also of Interest. In addition to these 
general questions, questions and observations that emerged regarding the 
Individual TOEFL sections are given below. 

Tlj Listening Comprehension . Although the background Information 
and questions In this section are presented aurally, the answer choices 
are given In written form only. The Influence of the reading component 
on the resulting listening scores Is, therefore, of Interest. 

T2» English Structure . The Items In this section seem to stress 
standard English usage as It Is spoken In the United States more than 
differences In meaning that are conveyed by structure. Generally, however, 
this section appears well accepted because the loq>ortance of testing 
English structure Is often emphasized In discussions of testing English 
as a foreign language (Bllyeu, 1969; Carroll » 1968; Fisher and Masla, 1965). 

T3> Vocabulary . There has been a general concern that vocabulary 

may be given too much emphasis as compared with English structure. At 

various times, one or more TOEFL Comodttee or ETS staff members have 

suggested dropping the Vocabulary section altogether. One criticism 

Is that a vocabulary section may encourage an overemphasis on vocabiilary 
scores; this probably contributes little that Is unique to what Is 

measured by the other TOEFL sections. Yet another criticism Is that, 
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too often, the vocabulary items include low-f requency, esoteric words 
that are of little practical use to the foreign student. 

T4> Reading Comprehension * This section has high face validity, 
but it requires substantially more testing time (and test development 
costs) to reach a given level of reliability than do most other sections 
of TOEFL. The general question, of course, is whether a more efficient 
measure can be developed to replace all or par'C^of this section, with its 
format of several reading passages, each followed by a number of questions. 

T5, Writing Ability ♦ The value of this section has been questioned 
by the TOEFL Committee and ETS staff, some of whom have suggested drop- 
ping it unless validity data clearly indicate that it should be retained. 
The use of a writing sample has been suggested as a replacement for the 
less direct Writing Ability section* Thus, the question of how well 
Writing Ability scores estimate essay-writing criterion scores is of 
particular interest. 
Hunt and Cloze Procedure Tasks 

Hunt rewriting task. The principal measure obtained from Kellogg 
Hunt's rewriting task, "Words per T-Unit," is essentially a refinement 
of the familiar sentence-length measure. The Words per T-Unit measure 
has worked very well for estimating the English language ''syntactic maturity"' 
of American children and adults (Hunt, 1970b), and it has intriguing 
possibilities for doing the same with respect to foreign students' command 
of the sentence-embedding aspect of English structure (Hunt, 1970a)* 

One question of interest is whether the Words per T-Unit scores on 
the Hunt task will indeed measure foreign students' command of English 

17 
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structure effectively and validly. Another question is how effectively 
these scores can be estimated if multiple-choice approximations to the 
Hunt task are used. 

Cloze procedure task . Variations of the Cloze procedure task, in 
which subjects are instructed to replace words deleted from prose passages, 
have long been of interest in language testing (see, for example, Taylor, 
1953, 1956; Carroll, Carton, and Wilds, 1959; and Oiler, 1973). Renewed 
interest in using the measure for foreign language testing was stimulated 
by Darnell (1970) , when he developed a scoring procedure (Clozentropy) 
based on the frequency with which a large sample of American college 
students gave various substitutions for a specific omitted word. Strong 
interest in evaluating this type of measure for possible use in TOEFL has 
been expressed by some TOEFL Committee members and by some ETS staff work- 
ing with the TOEFL program. 

Questions regarding Cloze measures that guided the present study in- 
cluded the following: What are the advantages and disadvantages associated 
with different scoring methods — Clozentropy or Standard Cloze (accepting 
original word substitutions only)? Using written essay scores as criteria, 
how valid are the Cloze scores? Using both essay and Cloze scores as 
criteria, how valid are various multiple-choice approximations to the 
Cloze measures? 

Because of various practical limitations, differences associated with 
the method of word deletion were not investigated. The procedure of delet- 
ing every tenth word from the Cloze passages was followed; alternative 
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procedures 9 such as the systematic deletion of nouns » adjectives , or 
function words, were not used. For the present study, an advantage of 
the nth-^ord deletion method was that it did not require the imposing of 
a priori constraints on the kinds of words to be tested. It provided, 
instead, a random-like saiiq>ling of words and of their immediate contexts. 
Alternative Multiple-Choice Subtests 

The development or selection of four alternative multiple-choice sub- 
tests that might fit into a future TOEFL was based on the questions and 
observations regarding content specifications that were noted earlier. 
These alternative subtests were administered as sections of an "Experi- 
mental TOEFL" and will be discussed in their order of appearance in that 
instrximent . 

Experimental Section XI, Sentence Comprehension . This section con- 
sists of test items taken from subsection Tla (Sentences) of Listening 
Comprehension sections of retired TOEFL forms. The only change was to 
present the questions or statements, as well as the answer choices, in 
the written mode. 

This section served two purposes. One was to provide a partial 
check on whether the difficulty of Listening Comprehension items is indeed 
in the listening component of the task. If this is true, a test made up 
of equivalent items, but presented entirely in written form, should be 
much easier than the listening-based item statistics would suggest. The 
second purpose was to measure reading comprehension at the sentence 
level. If the sentence is the basic meaningful unit of connected prose, 
then a sentence comprehension measure may very closely approximate the 
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less vleldy Reading Comprehension measure described above. The questions 
concerning Experimental section XI are directly related to these two 
purposes. 

Experimental Section X2, Words In Context . Each Item In this section 
consists of a complete sentence, with a target word or phrase underlined. 
The answer choices are alternative words or phrases for the underlined part 
of the sentence. For each sentence, the subject Is Instructed to . .find 
the one choice that will best replace the underlined part of the sentence, 
so that the basic meaning of the sentence remains the same.'' 

This Item format was used In part to meet the criticism that vocabu- 
lary Items foster In foreign students an undue emphasis on vocabulary 
study of the kind required In preparing for a test of synonyms. By a test 
of vocabulary (words) In context, the emphasis Is shifted to language study 
Involving natural message units (sentences) . 

The format of Words in Context may have greater face validity than 
either of the formats In the TOEFL Vocabulary section (T3a, Sentence 
Completion, and T3b, Synonyms), because the task Is like that often con- 
fronting any reader. In which he has both a word In question and Its 
context to help him understand Its meaning. 

In writing the Words In Context Items, an effort was made to use the 
kinds of words and contexts a student would be likely to encounter. This 
should further Increase the face validity of the subtests and meet the 
criticism that vocabulary Items too often test words the foreign student 
should not be expected to know. 

Further comment regarding the difference between the Vocabulary- 
Sentence Completion and the Words In Context formats may be helpful. 
Although the two look much alike, and may or may not yield scores with 
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slmllar characteristics, they are logically quite distinct. The Words In 
Context Item presents the candidate with a complete message unit. The 
dlstractors are words or sets of words which, when substituted for the 
underlined part of the sentence, yield a different message, but not 
necessarily an Incorrect or anomalous sentence. The Vocabulary-Sentence 
Completion Item on the other hand, presents the candidate with an Incomplete 
sentence and, thus, an Incomplete message unit. In order not to be **keyable,** 
each dlstractor must yield an Incorrect or anomalous result when It Is 
used to complete the sentence. 

Experimental Section X3, Combining Sentences . This task was 
developed to provide a multiple-choice sentence-embedding task that might 
approximate Hunt's rewriting task. The stem of each Item consists of 
three to five short sentences. Each answer choice combines the short 
sentences In a different way. The subject's task Is to . .choose the one 
long sentence that Is the best combination of short sentences." 

The main question in connection with the Combining Sentences task is, 
of course, how well scores on it estimate scores generated by the use of 
Hunt's rewriting task. Also of interest are how well the measure performs 
against essay writing criteria, and whether its pattern of correlations 
with other measures is logically satisfying. 

Experimental Section X4, Paragraph Completion . This section consists 
of two multiple-choice variations of the Cloze task. Each task includes a 
reading passage with some of the words omitted and replaced by a numbered 
blank > On a facing page, a set of numbers corresponding to the numbered 
blanks is given, with each number followed by four words, one of which is 
the word originally fitting the numbered blank. 
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Questlons concerning this Item type are: How well do Paragraph Com- 
pletion scores estimate Cloze scores? How veil does the measure perform 
against essay writing criteria? 

Related Questions 

To reduce the complexity of presentation, certain general questions 
have been held for this part of the discussion of the purpose and back- 
ground of the study. 

For virtually all of the questions that have been discussed In this 
section the answers may vary, depending on the language background of the 
candidates. Thus, a first general question may be asked: To what degree 
do the findings for candidates having an Indo-European first language 
hold for candidates from a non- Indo-European background? 

A logical case has been presented for differentiated testing of 
English language skills. However, It Is entirely possible that, for the 
great majority of candidates, the development of certain component skills 
may be so similar that there Is no practical utility In having separate 
test sections for each. Thus, a second general question may be asked: 
How Independent, In fact, are the component skills of English as a second 
language? This, In turn, has direct Implications for a third general 
question: '^ow many separate scores should be reported In TOEFL?" The 
second and third general questions may be refined to be answered separately 
for subjects having different language backgrounds. 
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PLAN AND PROCEDURE 

The plan of the study called for the administration of a battery of 
measures including TOEFL » alternative multiple-choice and open«-ended nea«* 
sures of English as a foreign language, and direct tests of speaking and 
writing performance in English. The tests, the subjects, and the proce* 
dures for administering and scoring the measures are described in this 
section of the report* The results, conclusions, and a discussion of the 
implications of the study are described in subsequent sections* 

Testing Hsterials 

Each of the five TOEFL and four Experimental TOEFL subtests is an 
objective measure using a multiple-choice format. As a multiple-choice 
measure, each can be readily employed in a large-scale testing program 
such as TOEFL, but none can provide a direct estimation of a candidate's 
performance in the productive areas of English as a second language. 
Multiple-Choice Measures Used in TOEFL 

Section Tl^ Listening Comprehension . This section has three parts: 
Sentences, Dialogues, and Lecture. 

There are two kinds of tasks in Tla, Sentences. One kind is 
answering a short question; the other is understanding a short statement. 
Each question or statement is presented in the spoken mode; the answer 
or paraphrase choices are given in %rritten form. 

Example I. When did Tom come here? Sample Answer 

(A) By taxi. I- OQ ID GclH 

(B) Yes, he did. 

(C) To study history. 

(D) Ust night. 
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Eusplt II. John dropped the Ictf r In the SmupIc Answr 

Mllbox* 

(A) John smt the letter. II. H Ifi] [c] 1^] 

(B) John opened the letter. 

(C) John lost the letter. 

(D) John destroyed the letter. 

In Tib, Dialogues, the csndidste hesrs a series of short conversa- 
tions between tvo apeakers. At the end of each converaation, a third voice 
asks s question sbout vhst hss been said. The four possible answers to 
esch question sre given in written fom. 

Exanple III. (aan) Hello> Hary. This is Mr. Saith at the office . 

la Bill feelina any better today ? 

(woaan) Oh^ yea> Mr. Saith. He' a feelint «uch better 
now. But the doctor aaya he'll heve to atay 
in bed until Monday . 

(third voice) Where ia Bill now? Swle Anawer 

(A) At the office. III. Q @ ■ (s) 

(B) On hia way to work. 

(C) HoM in bed. 

(D) Away on vacation. 

In Tic, Lecture, the candidate liatena to a brief lecture, and ia 
Inatructed to take notea aa he might if he were attending a univeraity 
lecture. A page ia provided for hia note-taking, at the top of which are 
written aeveral naaea and teraa that occurred in the lecture, of the kind 
a lecturer sight write on the Chalkboard in claaa* 

At the end of the lecture » the candidate opena hia teat book to a aet 
of queationa baaed on the lecture. He ia allowed to uae hia notea while 
anawarlng the queationa. 

T2> Engliah Structure . In thia aection each problem conaiata 
of a abort %rritten converaation between two apeakera, part of which haa 
been omitted. Four worda or phraaea are given beneath the converaation » 
one of %fhich will correctly complete it. 
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Exmplc I. 
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"John nMd« • pencil." 
"H« can ucc on« — — ■ 



ExMpU II. 



(A) of M 

(B) ay 

(C) alnc 

(D) of Mln* 

"Did you r«i«ab«r Mary's birthday?" 
"Y«,. I 

(A) har sent a gift 

(B) sant har a gift 

(C) to har a gift sant 
<S) a gift to har sant 



Saapla Answar 
I. &1 Ic) I 



II. a 



(£] li 



T3. Vocabulary . This sactlon has two parts » Santanca Coapla- 
tlon and Synonyms. Exaaplas of T3a» Santanca Coaplatlon Itana, ara 
tha follovlng: 

Exaapla I. A is usad to aat with. 



(A) plow 

(B) fork 

(C) hamar 

(D) naadla 

Exaapla II. To ascapa is to gat . 

(A) away 

<B) down 
(C) up 
<D) ovar 

Exaaplas of T3b, Synonyaa* ara tha following: 

Exaapla III. foolish 

<A) devar 

<B) alld 

(C) silly 

(D) frank 

Exaapla IV. a larga branch of a tree 

(A) straw 

(B) llab 

(C) bean 

(D) vine 
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Saiple Answer 

I- Si ■ (£] li 



II. 



Hi (c) (c) 



Saaple Answer 

III. Si H ■ IE 



IV. H 
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TA> Rf dint CoMprthgnslon . In this section, the candidate Is 
given • series of paragraphs to read, each followed by several questions 
about vhat it Mans. 

Saaple paragraph. The White House, the official horn of the 

President of the Uulted States, was designed by 
the architect Jaaes Moban, who Is said to have 
been Influenced by the design cf a palace In 
Ireland. The building vas begun In 1792 and vas 
first occupied by President and Mrs. John Ada»s 
In November 1800. The house received Its present 
naae when It vas painted white after being daaaged 
by fire In 1814. 

When was the White House first 

occupied? Saaple Answer 

(» 1776 I. IaI (i ■ !^ 

(B) 1792 

(C) 1800 

(D) 1814 

According to the paragraph, the 
President's house was first painted 
white when 

(A) President and Mrs. Adau II. B ■ Ic) 
requested that it be 
repainted 

(B) it was repaired following a 
fire 

(C) the architect suggested the 
new color 

(D) it was reaodeled to look like 
an Irish palace 

T5^ Writing Ability . There are two parts to this section. Each 
problem in TSa, Error Recognition, consists of a sentence in which four 
words or phrases are underlined, and aarked (A), (B), (C) , or (D). The 
candidate is asked to identify the one underlined word or phrase that 
would not be acceptable in standard written English. 
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Example I. At first the old woman seemed unwilling Sample Answer 

A 

to accept anything that was offered her I. (aI ID 0 B 

B C 

by my friends and 1^* 
D 

Example II. After they had chose the books they Sample Answer 

A 

wished to read , the instructor II. B (U 0 0 

B 

told them the principal points he 
C 

wanted them to note . 

D 

In T5b, Sentence Completion , each problem consists of an incomplete 
sentence. Four words or phrases, marked (A), (B) , (C) , or (D) , are given 
beneath the sentence. The candidate is to choose the word or phrase that 
best completes the sentence. 

Example III. Because he had little education, his 

knowledge of the subject was . Sample Answer 

(A) limited III. ■ (a 0 0 

(B) small in quantity 

(C) minor 

(D) not large at all 

Example IV. At 7:00 tonight, a public lecture on 
nuclear physics will be delivered in 
the University auditorium by a 



(A) real informed man IV. H (b) ■ 0 

(B) very authoritative guy 

(C) prominent scientist 

(D) person who knows a lot about it 

Alternative Multiple-Choice Measures 

The Experimental TOEFL subtests were developed at Educational Testing 
Service, specifically for the present study. A rationale for the inclusion 
of each experimental subtest was provided in the Purpose and Background 
section of this report. A description of the item format for each subtest 
follows. 2^ 



-19- 



Experlmental Section XI, Sentence Conqprehenslon , This subtest paral- 
lels Part A of the TOEFL Listening Comprehension section. However, the 
questions or statements, as veil as the answer choices (options), are pre- 
sented In written form. The examples shown above for the TOEFL Listening 
Comprehension section, Tla, apply as well to Experimental TOEFL, Section XI. 

Experimental Section X2^ Words In Context . Bach sentence in this 
section has a word or phrase underlined* Four choices are given beneath 
the sentence. The candidate is to select the option that will best re- 
place the underlined part of the sentence, so that the basic meaning of 
the sentence remains the sane. 

Example I. He discovered a new route through 

the mountains. Sample Answer 



(A) wanted I. A B CD 

(B) found li 

(C) traveled 

(D) captured 



I 



II II 

«tt ti 

tt If 

11 II 



A 


B 


C 


D 


If 


II 




II 


II 


II 




II 


II 


ti 


1 


II 


II 


II 




II 



Example II. Their success came about as a 
result of your assistance. 

(A) according to II. 

(B) before 

(C) because of 

(D) during 

Experimental Section X3, ComBinln^ Sentences, Each item in this 
section consists of a group of shorty related sentences. Four long 
sentences are given below the group of short sentences. For every itemt 
each wrong option presents a message which differs from that conveyed by 
the short sentences in the stem although it does not necessarily depart 
from standard usage. The candidate's task is to choose the option that 
is the best combination of the short sentences. 
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Example I* John is in the store* It is a hardware 
store* Fred Is also in the store* They 

are buying tools* Sample Answer 

(A) John is buying tools from Fred I. 

in the hardware store. 

(B) John is buying hardware tools 

from Fred in the store* 

(C) John and Fred are buying tools 

in the hardware store* 

(D) John and Fred are buying hardware 

tools in the store* 



A 


B 


C 


D 


M 


II 




ti 


II 


II 




tt 


II 


II 


1 


tt 


If 


II 




tt 



Example II* There was an accident* A car Sample Answer 
went off the road. A young 

man drove it. The car belonged II. A B C D 

to his father. I \\ \\ \\ 

I II II If 

I II II II 

A young man * * . 



(A) accidentally drove his 

father's car off the road. 

(B) accidentally drove the car 

off his father's road* 

(C) drove his father's accidental 

car off the road. 

(D) drove the car off his father's 

accidental road* 



Experimental Section XA, Paragraph Completion . This section is made 
up of two reading passages, each with some words omitted and replaced by a 
numbered blank. On a facing page, a set of numbers corresponding to the 
numbered blanks is given, with each number followed by four words* For 
each numbered blank, the candidate is to choose the word that best fits 
the context* 
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Examples I and II. For good reason, historians use the I of writing 

to mark the divide between history II prehistory. 



Sample Answer 



I. (A) job II. (A) in 

(B) effort (B) or 

(C) decision (C) and 

(D) invention (D) from 



II. 



A 


B 


C 


D 


II 


M 


II 




II 


II 


II 


1 


II 


11 


II 




ti 


11 


II 




A 


B 


c 


D 


II 


II 




II 


II 


M 


1 


M 


ti 


II 




II 


II 


II 




II 
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Open^nded Objective Measures 

It is sometimes assumed, incorrectly, that objective measures are 
necessarily cast in the multiple-choice item format. Two interesting 
exceptions to such a proposition are the Hunt rewriting task and the Cloze 
procedure. Both tasks impose more constraint than a free^response essay 
assignment would, but the responses called for are, nevertheless, open* 
ended rather than multiple*choice, and they are objectively scorable. 
As measures of writing ability, the tasks are more direct than multiple* 
choice measures, but less direct than an essaywriting assignment. As 
might be expected, they are also Intermediate with respect to scoring 
costs. 

Hunt^s Aluminum passage . With the permission of Kellogg Hunt, the 
Aluminum passage reported in his study (1970b) was used in the present 
study. The same directions, translated into Spanish and Japanese for 
the two subject groups, were also used. The directions in English read 
as follows: 

Directions : Read the passage all the way through. You 
will notice that the sentences are short and choppy. Study 
the passage, and then rewrite It in a better way. You may 
combine sentences, change the order of words, and omit words 
that are repeated too many tines. But try not to leave out 
Q any of the information. 
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The passage presented to the subjects consisted of 32 very short 

sentences of connected dlscoursel The first portion Is as follows: 

Aluminum Is a metal. It Is abundant. It has many uses. It 
comes from bauxite. Bauxite Is an ore. Bauxite looks like 
clay. 

Subjects combined these sentences with varying amounts of embedding » and 
with varying degrees of success In retaining the original units of Infor- 
mation. All six of the above sentences could be embedded Into a single 
sentence I yielding something like the following: Aluminum^ an abundant 
metal with many uses^ comes from batixltei a clay-like ore. 

Hunt's principal measure of "syntactic maturity," Words per T-Unlt, 
was also adopted for use In the present study. Hunt (1970b) defines the 
T-unlt, or "minimal terminable unit/' as ". . . one main clause plus any 
subordinate clause or nonclausal structure that Is attached to or embedded 
In It • . • • So cutting a passage Into T-unlts will be cutting It 
Into the shortest units which It Is grammatically allowable to punctuate 
as sentences [p. A]." Largely an objective measure, the Words per T-Unlt 
measure requires some Initial judgment, but once T-unlt boundaries have 
been agreed upon, It amounts basically to a word count. Furthermore, 
very high agreement on assigned T-unlt boundaries can be reached by 
trained readers judging the rewritten passages. 

Prior to the actual scoring of Words per T-Unlt, Hunt's procedure 
deleted extraneous, unintelligible, or Inaccurate passages. When these 
were found, the entire sentence Involved was deleted. In the present 
study extraneous sentences were also eliminated, but faulty sentences 
were retained. 
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An additional score^ "High K^s/' was generated for this 8tudy» to 
indicate the number of short kernel sentences (K^s) that were adequately 
represented in each subJect^s rewriting of the Aluminum passage. In 
contrast to the Words per T-Unit score^ the High K^s score is definitely 
subjective and as such requires a considerable amount of time and effort 
to reach satisfactory levels of agreement in Judgment. 

Cloze passages * Two prose passages^ each about 280 words in length 
were used as the basis for the Close procedure tasks. The first sentence 
in each passage was left intact to provide an adequate context to begin 
the task. In the remainder of the passage every tenth word was replaced 
by a blankt yielding 25 blanks for each passage. 

The original instructions for the Close tasks are shown below. 
These were translated into Spcnish or Japanese with the exception of 
the italicized words and the handwritten answers » which were retained 
in English. 

Instructions; This section contains two reading passages with 
some words omitted. Pill in the one word that you think best 
completes each blank in the passage belofw. 



You wotild probably write in the first blank. 
You might write ^safMt^ or ^quickBBt^ or 
^ah0cp0Bt^ in the second blank. Use any word which 
seems to be correct to you. 

Thus» you might complete the two sentences: 



EXAMPLE 



John cam to Bohool 



bu$. Be thought it 



DOB the 



way. 




S0 thought it 
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We suggest that you seen ah entire passage, go over it again 
filling in the easy blanks, then go back a third time filling 
in the difficult ones* 

Subjective Measures 

The English language productive skills, speaking and writing, are 
fundamental to the academic success of the foreign student. The language 
proficiency interview and the essay writing tasks developed for this study 
required actual performance in these skills. These subjective measures 
were designed to provide direct and reliable, as well as feasible, 
estimates of the speaking and writing abilities needed in the academic 
setting. Although they were somewhat removed from the speaking and 
writing tasks encountered in the classroom, the measures had the 
advantages of assigning the same tasks to all subjects and of providing 
scores not influenced by the extraneous factors influencing a criterion 
such as teacher-assigned grades* 

Interviews > Performance in English language conversation was 
measured by means of structured, tape-recorded interviews which were 
conducted at the testing sites* The performance judgments were subse- 
quently made by staff members at Educational Testing Service* The 
judgment scales were Accent, Grammar, Vocabulary, Fluency, and Overall 
Conmunication* Scores for the latter three scales were derived 
separately for the general or narrative part of the interview and for 
the academic or technical part. 

The procedure for conducting and evaluating the interviews was 
adapted from the Peace Corps Language Proficiency Interview, which in 
turn derives In part from the Absolute Language Proficiency Rating 
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prepared by the Foreign Service Institute for the classification of 
officers of the United States Departaent of State (Rice, 1959; Wilds, 
undated) • 

The interviews were typically 20 to 30 minutes in length. In Peru 
and Chile, they Included four stages. The first was an exploratory stage 
of about 2 to 4 minutes, designed to put the interviewee at ease. Second, 
there was a narration task of about 5 minutes duration, in which the 
candidate was asked to relate in English a story from a Spanish-language 
comic book. (He was allowed to select one of eight such stories beforehand 
and was given 10 minutes preparation time innediately before the interview. 
He could look at the comic book while telling the story, but was advised 
to use the pictures only, because an attempt to translate rather than 
narrate woxild be likely to hinder, rather than facilitate, his narration.) 
Third was a 4 to 6 minute paraphrasing task in which a graded series of 
increasingly complex sentences was read to the candidate. After each sen- 
tence was read, the subject's task was to paraphrase it. The fourth was 
an «ialytical stage, 6 to 10 minutes long. In order to provide an 
indication of the student's English language skills that would be appro- 
priate in an academic setting, the conversation was directed toward the 
student's plans and major areas of study, the ideas and writers of 
interest in his or her field, etc. The second and third parts of this 
interview are innovations not found in the FBI and Peace Corps interviews. 

Interviews in Japan followed the same pattern, except that the 
second stage was a period of general conversation, rather than the 
narrative task. 
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For the narrative or general part and for the analytic part of the 
Interview 9 Interviewers were Instructed to try to get a good representa- 
tive sample of the candidate's capacity In specific language areas. 
Thus 9 the Interview was to give the student the opportunity to show his 
knowledge of verb forms^ person-subJect-obJect agreeaent^ the forsatlon 
and use of adjectives and adverbs » and other aspects of graanar. 
Similarly^ It was to be conducted so that It would provide a good saaiple 
of the candidate's general vocabulary and academic or technical vocabu-^ 
lary. Interviewers were Informed that the candidate would also be 
Judged on accent » fluency » and overall communication » but that Interviewers 
did not need to try In any way to help the student demonstrate his 
performance In these areas. 

Details regarding the administration and subsequent scoring of the 
tape-recorded Interviews will be given below. 

Essays . Four essay tasks were assigned » with 10 minutes provided 
for each.^ The assignment of several short essays rather than one or two 
longer ones was based on evidence (e.g. » Godshalk, Swlneford^ and Coffman^ 
1966) that essay ratings vary significantly from topic to topic. TWo of 
the essays used pictures for stimuli; the other two did not. 

Instructions written in Spanish or Japanese were provided for each 
of these tasks. 



These tasks » based on materials developed by John Carroll and an inter- 
national committee » are described in the document » International 
Association for the Evaluation of Educational Achievement » Phase II » 
Stage 3» French as a Foreign Language » June 1970. Permission to use 
these materials was received from the I.E. A. Bureau^ communicated 
through T. Neville Postletfawalte^ Executive Director^ I.E.A. Wenner- 
Gren Center » Stockholm^ Sweden. 
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In the first essay task, the subject vas presented a sequence of 
three pictures In ccmlc-book foraat and vas instructed to describe (in ^ 
English) vhat is happening in the set of pictures. The second task called 
for writing a dialogue between cvo boys and incorporating certain words or 
expressions listed in the directions, such as 'beautiful day," "to take a 
walk," and "bicycle." The third presented a single picture, in which a 
boy's bicycle has Just been daaaged by an automobile. The candidate was 
instructed to describe what he thought led up to the event, what is happen- 
ing in the picture, and what will happen next. The fourth exercise called 
for a coaposition coaparlng the advantages of living in the country and in 
a large city. The candidate was provided certain terse, such as "peaceful- 
ness" and "department stores," to be included in the essay. 

Subjects 

Participants in the study included 98 Peruvians, 145 Chileans, and 
199 Japanese, to whoa Fom TBF4 of TOEFL was adalnis tared in Lisa, 
Santiago, and Tokyo, respectively. 

Several considerations led to the inclusion of subjects froa the two 
language backgrounds, Spanish and Japanese. If two very different first 
languagas are represented, one Indo-European and the other not, the 
fiadinga can be aore generally interpreted than if a single language or 
only closely related languages are represented. Limiting the backgrounds 
to two made it possible to have enough subjects from each language group 
to allov aeaningful and useful analyses of coi^lex questions. The Spanish 
and Japanese backgrounds were chosen because they meet the criterion of 
being distinctly different froa one another, an^ because a high volume 
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of TOEFL candidates ccmta froa aach* Tha lattar fact aada It poaalbla for 
data to ba gatharad In a faw cantral locations froa a aufflclantly larga 
nuabar of candldataa* It also aaant that the findings spadflc to tha 
raapactlva languaga backgrounds vould ba dlractly ralavant to a larga 
nuabar of paopla* 

Tha dadslon to adslnlstar TOEFL and tha axparlMntal rasaarch aaasura 
to candidates In thalr own countries was basadCn tha follovlng conaldara* 
tlona* Firsts tha students* parforaanca In English as a second language 
vould not have bean Influenced by varying aaounts of foraal and Inforaal 
exposure to English In the United States* Second^ It was auch easier to 
test the required nuaber of subjects In a few locations* Finally^ testing 
overseas avoided tha problea of restriction of score range inherent in 
testing foreign students already accepted for study on an Aaarlcan caapus^ 
vhose adaisslon vaa uaually based in part on TOEFL scores* 

An Individual Background Quaationnaira was developed and translated 
into Spanish and Japanese* All participants coaplated the quaitlonnalre^ 
which aaked their age» s^x» parents* education » aajor area of study^ level 
of education coaplated » recent school gradea^ and the source and aaount of 
foraal English language inatruction and inforaal exposure to English* They 
were also asked to estlaata their general level of coapatence in readings 
%rriting» liatening^ and speaking English* The questionnaire responses are 
discussed in the Reaults and Conduaiona section of this report* 
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Atolnlstftion of Hwuw 
Thm r«crulta«nt of oubj^cts and tha adalnlotratlon of all atasurM 
•xcapt TOEPL v«ra carrlad out by collaborators and thalr support staffs at 
tha Instltuto Paruano da Fostnto Educatlvo (IPFE) In LlaSt Paru» at tha 
Unlvcrsldad da Chlla In Santiago^ Chlla^ and at Languaga Education 
Assodatas^ Incorporatad^ and tha Intarnatlonal Education Cantar» both 
In Tokyo ^ Japan* 

Racruitlng was carrlad out primarily by contacting studants vho had 
appUad to taka tha Octobar 1971 administration of TOEFL and vho vould 
taka that tast In ona of tha cltlas no tad abova* 

Instrumantatlon^ as usad In tha study ^ vas groupad Into tha following 
units: (1) TOEFL» Fom TEF4» with tha flva sactlons dascrlbad aarllar In 
this part of tha raport. Total tasting tlsa^ 2 hours» 20 slQutas. (2) An 
ExparlMntal Tast of English as a Foralgn Languaga^ vlth tha four sactlons 

also dascrlbad aarllar. Tasting tl»a» 2 hours^ 25 slnutas* (3) An Bxparl* 
Mntal Tast of English Writing Ability. (Thara vara Spanish and Japanasa 
varslons of this Instruaant^ vlth dlractlons In tha approprlata languaga.) 
This tast contalnad tha Hunt ravrltlng task» tha Closa procadura passagas^ 
and tha four assay asslgnnants. Tasting tiaa^ 2 hours. (4) Individual 
Background Quastlonnalra^ Spanish and Japanasa varslons. Coaplatlon tlaa» 

about IS Binutas. (5) A atructurad English proflclancy Intarvlav. Intar^ 

vlav tlBa» 20 to 30 minutas. 
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Iq Ptrut all partlclpaata in tht study took TOEFL at tha ratular 
Octobar 1971 adslniatratlon of that axMlnatlon. In Chlla and Japant about 
half of tha partlclpanta took TOEPL at tha ragular adftinlstratlon» and half 
took it at a tpaclal adainlstratlon undar standard conditions. All raaain- 
ing instruMnta vara adainistarsd at tha inatitutions notad abova. Tha 
vrittan tasts vara givan within thraa days aftar TOEFL was adsiniatarad. 
Thtsa vara follovad by tha intarviavs and background quastionnairas» wst 
of vhich vara coiiplatad vithin thraa vaaka. 

All intarvitvara vara nativa spaakara of English. In Faru and Japaa^ 
wst of tha intarviavars vara racruitad froa Aasrican aabassy paraonnal 
or Mi^ars of thair faailias. 'IThay vara trainad by paraons axpariancad in 
adninistaring tha Faaca Corpa Languaga Froficiancy Intarvlav. Each bagan 
vith practica intarvitvs* folloving tha procaduras davalopad for tha pra- 
aant study. In Chila» tha intarviavs vara conductad by profassors Evar 
and Hughas-Davias and thraa of thair staff Mabars. All vara axpariancad 
in tha usa of intarviavs to aasass tha English languaga skills of Chilaan 
atudants. 

Scorint Procaduraa 

Multipla-choica Msaauras 

Tha five TOEFL and four Expariaental TOEFL aeasuras vara scorad 
according to tha standard procaduras appliad to aultipla-choica taats. 
Bacau&e "right a only" scoring is usad for TOEFL » tha ssaa procadura vaa 
appliad to tha Expariaantal TOEFL. (Tha fact that thara is no panalty for 
guaaaing vaa indicatad in the candidates* instructions for both exaainationa.) 
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Opu^^ndtd Objtctlv MwurM 

Ihint^s Alimlnum paww * Scort shttts and instructions ustd to obtain 
th« Ifords par T-Unit and High K*s valuas ara shown in Appandix A. For aach 
candidata*s ravriting or protocol^ tha Words par T-Unit valua was obtainad 
by first striking out axtranaous santancas^ than marking tha T-unit bound- 
arias^ than counting tha nimbar of vords and T-unlta and coaiputing tha 
ratio of Uords/T. Tha adaquacy vith vhich tha infomation io aach karnal 
santanca or K vas axprassad in a glvan protocol was judgad as high» sadiua^ 
lov^ or absant^ using tha instructions shown in Appandix A. Tha High K*s 
scora usad in subsaquant analysas was ths total nuabar of tha 32 K*s Judgad 
to hava baan affactivaly axprassad in tha protocol. Tha High K rating was 
datarsinad by whathar tha K in quastion was claarly and unaabiguously 
statad; it did not nacassarily hava to ba statad in standard English. 

Tha Hunt protocola obtainad for tha study wara scorad for Words par 
T-Unlt and for High K*s by two scorars^ who first practicad with protocols 
obtainad through pratasting. Approxiaataly ona-thlrd of tha protocols for 
tha study wara ratad indapandantly by both scorars. This provldad a 
basis for as tlaatlng scorar raliabllity^ and for tha scorars and tha author 
to Met pariodlcally to consider differences in scoring and aubsequently 
to refine scorinit procedures. The other two-thirds of the protocols re- 
ceived only a single rating. 

Protocols were randomly assigned to batches for 8coring» and were ran- 
domly ordered within each batch. Those in batches scored by both scorers 
were assigned two randoa orderlngs^ one for each Judge. These randomi- 
sation procedures were used to prevent any systeaatlc order effect » such 
aa might occur through shifting standards over a period of making Judgments. 
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Cloze passages . As a preliminary step for Clozentropy scoring, the 
Cloze passages were administered to 260 American college students. Their 
responses to each blank were tallied, and a "dictionary" was made up for 
that bljuik, listing the response words and the frequency with which they 
occurred. For example, one of the blanks appeared In the second Cloze 
passage, as follows: 

But^ 08 morning advanced^ the strength of 

the guBte diminished. 
The dictionary for this blank, based on American students* responses, 
was: 



Word 


Frequency 


Word 


Frequency 


Word 


Frequency 


each 


1 


late 


3 


Soon as 


1 


early 


21 


mid 


1 


the 


230 


first 


1 


new 


1 


usual 


1 



Using Reilly's (1971) simplification of Darnell's (1970) Clozentropy 
scoring procedure, a foreign student's response to a given blank was 
scored as follows. If be did not fill in the blank, or if his response 
was a word not listed in the dictionary for that blank, he was given a 
score of zero. If his response was one of the words in the dictionary, 
his score for that blank was the logarithm of the frequency associated 
with his response word. 

In the Standard Cloze scoring procedure, the foreign student's re- 
sponse was compared with the word that had been omitted. If his response 
matched the original word for a given blank, the student was given a score 
of 1; if it did not match, his score was zero. 

Both sc'Jiing procedures were carried out by computer, and were not 
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unreasonably expensive. The expensive and time-consuming part of obtain- 
ing scores for Cloze responses lay in the work of preparing them for 
keypunching, and in the keypunching itself* 
Subjective Measures 

Interviews . Each candidate's proficiency in spoken English was rated 
independently by three listeners who used the interview rating sheet and 
the proficiency descriptions shown in Parts 1 and 2 of Appendix B. 

In preparing each interview tape for the listeners, the author or a 
research assistant listened to the complete tape, and then marked two seg- 
ments to be used in assigning the proficiency ratings. One 4 or 5 minute 
segment was selected from what appeared to be the candidate's best perfor- 
mance on the narrative or general conversation part of the interview, and 
the other from the best portion of the academic or technical part. This 
procedure reduced by two-thirds the listening time needed to assign ratings, 
while ensuring that the three sets of ratings for each candidate were still 
based on the same language samples. 

There were ten listeners, including the author. He and two research 
assistants developed eight training tapes and eight practice tapes for the 
initial training of the other seven listeners. Each training tape was 
accompanied by an information sheet which provided the scores agreed upon 
by the three ETS staff members and comments illustrating what might be 
written In the lower half of the rating sheet regarding the bases for the 
scores assigned. Four of these information sheets are shown in 
Appendix B. After working with the training tapes, all listeners rated 
the practice tapes independently. Group training sessions were held, in 
which differences in rating were discussed and resolved. 



The proficiency descriptions were designed to keep the Interview 
ratings on an absolute » crlterlon->re£erenced scale of English language 
proficiency. In order to reduce the tendency to nove to a normative 
scale 9 or for the rating of a given Interview to be unduly Influenced by 
the proficiency demonstrated In immediately preceding tapes » raters 
regularly referred to these descriptions when scoring the Interviews. 

It was found that the listeners were more comfortable about assign- 
ing ratings If be tweenr- level values could be used. The scale was 
therefore expanded to Include 16. levels: !» 1+^ 2-*9 2^ 24-^ • . . » 5+, 
6^^and 6. The six immodlfled numbers are still anchored by the proficiency 
descriptions. 

In making their ratings » listeners were urged to keep the several 
scales Independent. It was pointed out^ however^ that the Communication 
scale 9 with its emphasis on the ability to convey meaning or content 
effectively^ was bound to be Influenced in varying degrees by the other 
proficiencies that the listeners were rating. 

For scoring purposes » the taped interviews were grouped into 20 
batches of 20 tapes each» and a final batch of 22. In assigning tapes 
to batches» proportional representation of the three nationalities was 
imposed 9 so that each of the 20 batches had fotir or five tapes from Peru^ 
six or seven from Chile » and eight or nine from Japan. Then^ for each 
batchy the designated number of tapes for each nationality was randomly 
selected from the set of tapes from the respective countries. 

Tapes were assigned to listeners in the batches of 20. Each batch 
was listened to independently by three listeners. Whenever a batch was 
assigned to a new listener » a new random list of nximbers 1-20 was also 
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assigned 9 Indicating the order In which the tapes were to be rated. The 
final batch of tapes was used for continued training and to keep all 
scoring coordinated. There were five coordinating sessions In which all 
raters met and listened collectively to three to five tapes* Each rated 
the tapes Independently. After each tape was played and rated, the rat- 
ings were tallied, the differences discussed ^ and new rating problems 
and considerations were examined. 

Essays . As noted above, the candidates wrote four short essays, 
each on a different topic. Each essay was rated Independently by two 
or more readers on each of the following scales: 

Ela Content (Quantity) • The number of Ideas and 
concepts expressed^ the degree of elaboration, 
etc. 

Elb Content (Quality) . The adequacy and Interest 
value of the story llne^ Internal consistency, 
how well the story "comes across and whether 
It Is easy to follow. 

E2a Form (Quantity). The range of vocabulary and 
of grammatical structures used. 

E2b Form (Quality) . The appropriateness and 

effectiveness of the vocabulary and structures 
used. 

The essays were scored by a procedure using several readers for each 
paper ^ each making rapid ^ holistic judgments regarding one of the four 
scales described above. The procedure was derived from that developed 
for the Godshalk et al. study^ The Measurement of Writing Ablllt^y (1966) > 
and since refined In essay readings conducted as part of the College 
Board's English Composition Test. 

The 21 participating readers were all experienced with essay-grading 
procedures like those used dLn this study. All had participated In other ETS 
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essay readings, In cooperation with the ETS Essay Reading Group. 

Readers were organized Into four groups or tables, each with a table 
leader. Readers at one table read only for scale Ela, Content (Quantity), 
those at another table read only for scale Elb, Content (Quality), and so 
on. By this arrangement, readers at a given table could develop a "set" 
for the kind of Judgment they were to make, keeping It Independent of 
the other three kinds of judgment called for. 

Prior to the readings, the table leaders were oriented by the chief 
reader, and each prepared in turn to orient the readers at his or her 
table. A 6-point normative-reference scale was introduced, in which 
level 2 represented the typical or median paper for the lower half of the 
essays, and level 5 the typical essay among the upper half. The score 
for a lower-half essay could be "shaded down" to level 1, or "shaded up" 
to level 3. Similarly, an upper-half essay could be graded 5, or shaded 
to either A or 6. After being assigned to one of the four scales, each 
table leader scanned a sample of Topic I essays, then selected essays 
that epitomized each of the six levels of writing proficiency. These were 
subsequently used to orient the table readers, and to serve as reference 
points for the six levels. The same procedure was followed in preparing 
to read and score essays on Topics II, III, and IV. 

All essays on Topic I were scored first, then those on Topic II, and 
so on. While essays on a given topic were being scored, they were moved 
In small batches from one table to another until each had been scored 
Independently by two readers at each table. As essays went through each 
step of this sequence of eight readings, the order was scrambled to pre- 
vent any order effect, and each successive score was covered up so that 
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the following ratings would not be Influenced by those preceding. 

Whenever the two ratings for a given essay on a given scale were 
separated by two or more Intervening categories (e,g«» If the scores 
were 1 and 4^ or 2 and S), a third independent rating was obtained. 
In those cases » the third rating and the earlier one closest to It were 
retainedt and the other score dropped « 





RESULTS AND CONCLUSIONS 



To provide a background for dlacusslng test results, the subject 
groups will be described first, primarily from their Individual responses 
to the background questionnaire* Then summary test statistics from the 
several multlple^cholce measures, open-ended objective measures, and 
subjective measures will be examined and compared, as well as the Inter- 
relationships among them. Finally, conclusions regarding each of the 
measures will be discussed in turn* 

As noted earlier, participants in the study Included 98 Peruvians, 
145 Chileans, and 199 Japanese, all of whom were administered form TEF4 of 
TOEFL* Because of some attrition, respondents to the background question- 
naire numbered 86, 136, and 196, respectively* The smallest numbers of 
candidates responding to test materials in addition to TOEFL TEF4 were 91 
for Peru, 140 for Chile, and 187 for Japan* The N's for the separate 
measures are given in the appropriate tables* In the data analyses, 
missing data procedures were used where needed* 

The substantial linguistic differences between the Latin American and 
Japanese participants provide a basis for partitioning between findings 
that are unique to those having either Spanish or Japanese language 
backgrounds and those that hold for both languages and that may therefore 
be more readily generalized to subjects having yet other language backgrounds 
Throughout the following discussion similarities and contrasts between 
Spanish and Japanese data will be pointed out, as well as those between 
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data from the two Spanish language groups, Peruvians and Chileans. 
Although these observations will be of Interest for what they Indicate 
about the English language skills of the respective groups, they will be 
focused primarily on Implications regarding the present and alternative 
TOEFL measures ttnder Investigation. 



Responses to general background questions are reported by language 
group In Table 1. In discussing these data, the differences between the 
Peruvian and Chilean samples will be noted where appropriate. 

The responses to question 3 show that less than one percent of the 
sample were from families where English was the language used In the home. 
Among Peruvians, the non-Spanish languages Indicated were Chinese and 
Quechua (Inca). Among Chileans, the non-Spanlsh language used In the 
home was usually either German or Italian. Among the Japanese subjects, 
the non-*Japanese language used In the home was Chinese In four Instances, 
and Tagalog In the remaining one. 

For the two language groups, 77 percent of the Spanish and 61 percent 
of the Japanese Indicated having completed three or more years of higher 
education* Slightly over three-fourths of the Spanish subjects planned to 
earn a doctorate, but this was true for only one-tenth of the Japanese. 
This difference apparently reflects the difference in use of the doctoral 
degree in the respective countries* 

Regarding college majors, the natural sciences were favored by the 
Spanish subjects, but not by the Japanese. Engineering as a major field 
ms also considerably more popular in the Spanish group than iu the Japanese, 
where higher percentages are shown in the humanities and social sciences. 



Background Questionnaire Responses 
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Table 1 



Reaponse* to General Background Questions, by Percentages 



Background 
<luaatlon 



Language 

group 
Sp Jpn 



Background 
question 



Language 

group 
Sp Jpn 



1» Age 



less than 21 
21-23 
24-26 

27 or older 
Mo response 

Sax 



Kale 
Faaale 



Language 

In the home 

Spanish 

Japaneae 

English 

Chinese 

German 

Italian 

Other 



4. Education 

completed 

Secondary school 

or less 
1-2 years higher 

education 
3-4 years higher 

education 
1 or more years 

graduate school 
Mo response 



15 
31 
22 
23 
9 



18 
40 
19 
22 
0 



70 
30 



60 
40 



92.3 0.0 
0.0 97.0 
0.5 0.0 



1.4 
2.3 
1.4 
2.8 



2.0 
0.0 
0.0 
0.5 



12 
8 

24 

53 
4 



12 

26 

51 

10 
1 



5. 


Educational 










objective 










Bachelors or 










less 


4 


39 






Licentiate/ 










Masters 


14 


29 






Doctorate 


77 


10 






Other 


0 


6 






No response 


6 


16 


. *i 


6. 


Secondary 










school maior 










College prep 


56 


57 


'A 
it 




Community/ 










vocational 


3 


6 






General 


29 


36 






Other 


1 


1 






No response 


11 


1 


*l 


7. 


Higher 










education ma.lor 








Agriculture 


6 


1 






Economics 


10 


18 






Education 


10 


2 






Engineering 


29 


16 






Humanities 


1 


2 






Language/Lit . 


5 


24 






Natural Sclencel2 


4 






Social Science 


6 


12 


It 




Professional 










Services 


8 


7 






No response 


12 


15 





Mote: Tabulated responses are from questionnaires administered to 86 
Peruvian, 136 Chilean, and 196 Japanese participants. 
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Self-reported grades In English as a second language and self-ratings 
on English language skills are given In Tables 2 and 3* The self-reported 
grades In English reading and writing were quite similar between the two 
language groups, but the Japanese tended to rate themselves somewhat 
lower than did the Spanish In English listening and speaking. In their 
self -ratings on all four English language skills (newspaper reading, essay 
writing, listening, and conversing), the Japanese tended to rate themselves 
much lower than did the Latin American candidates. 



Table 2 

Self-Reported Grades in English as a Foreign Language 



Level of 


ReadlnR 


WrltlnR 


LlstenlnR 


SpcaklnK 


itrades In BFL 


Sp 


Jpn 


Sp 


Jpn 


Sp 


Jpn 


Sp Jpn 


Excellent 


32 


35 


30 


30 


29 


14 


28 18 


Good 


A2 


35 


38 


32 


34 


29 


28 31 


Fair 


16 


17 


20 


22 


19 


21 


27 18 


No response 


8 


13 


11 


16 


17 


36 


17 33 



Hote: N » 222 Spanish-language (86 Peruvian and 136 Chilean), and 
196 Japanese-language participants i» 

Although the questionnaire data have not been separated into Peruvian 
and Chilean responses, taped interviews and other sources of information 
indicate that the two groups are distinct in certain respects. Participants 
in the Chilean sample account for most of the Spanish subjects already 
in graduate school and aspiring to a doctorate* They %iere typically 
middle«*class or higher, and urban (from Santiago or its suburbs)* By 
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contrast, the Peruvian subjects were typically from smaller towns or 
cities some distance from Lima, and were lower^mlddle class or below. 



Table 3 

Self-Ratlngs on English LanRuage Skills 



Mevspape r Es say 

Level of difficulty reading writing Listening Conversing 

In stated activity Sp Jpn Sp Jpn Sp Jpn Sp Jpn 

Easily 40 15 44 5 29 10 29 5 

With some difficulty 56 82 47 79 57 83 57 77 

With much difficulty 2 3 8 16 12 6 13 17 

Hote: N ■ 222 Spanish-language and 196 Japanese-language participants. 



Summary Test Statistics 
Summary test statistics will be presented for the three subject 
groups for the multiple-choice measures found In TEF4 and In the "Experi- 
mental TOEFL," the open-ended objective measures derived from Hunt's 
aluminum passage and the two Cloze passages, and for the subjective 
Interview and essay measures* 

Multiple-Choice Measures 

Means and standard deviations of multiple-choice scores for the 
three subject groups are shown In Table 4* "World" data are also given 
for the five major TOEFL scores, Tl through T5* The latter are from a 
spaced sample of 1000 candidates who took the October 1971 TOEFL examin- 
ation. Form TEF4, as reported In ETS Statistical Report 71-112 (Swlneford, 
1971)* 

5J 



MMna and Standard Davlatlons of Multiple-Choice Measures 



Mean Standard deviation 



Multiple-choice measure 


Items 


Min. 


Peru 


Chile 


Japan 


World 


Peru 


Chile 


Japan 


World 


Regular TOEFL sections 


Tl Listening Comprehension 


50 


AO 


22.9 


28.7 


30.9 


30.1 


11.7 


10.5 


7.5 


8.8 


T2 English Structure 


40 


20 


16.1 


23.0 


24.4 


23.8 


9.1 


8.2 


5.6 


7.0 


T3 Vocabulary 


40 


15 


19.4 


24.1 


20.3 


22.4 


7.3 


5.8 


6.5 


6.9 


T4 Reading Comprehension 


30 


40 


14.0 


19.2 


17.6 


16.5 


6.3 


4.8 


4.2 


5.0 


T5 Writing Ability 


40 


25 


13.8 


19.7 


21.0 


21.6 


7.6 


7.4 


5.7 


6.7 


(M) 






(98) 


(1A5) 


(199) 


(1000) 









Regular TOEFL subsections 



Listening Comprehension 


















Tla Sentences 


20 


10 


8.3 


11.1 


12.1 


5.6 


4.7 


3.4 


Tib Dialogues 


15 


12 


7.0 


8.5 


9.2 


3.7 


3.7 


2.6 


Tic Lecture 


15 


18 


7.6 


9.1 


9.6 


3.3 


3.1 


2.8 


Vocabulary 


















T3a Sentence Completion 


15 




7.3 


9.1 


7.2 


2.9 


2.6 


2.7 


T3b Synonyms 


25 




12.0 


15.0 


13.1 


4.8 


4.0 


4.4 


Writing Ability 


















T5a Error Recognition 


25 




8.6 


12.2 


14.1 


4.8 


4.6 


3.7 


T5b Sentence Completion 


15 




5.2 


7.5 


6.9 


3.5 


3.8 


3.1 





















Experimental TOEFL 



XI 


Sentence Comprehension 


30 


25 


21.7 


27.1 


26.6 


7.0 


3.2 


2.9 


X2 


Words In Context 


30 


30 


18.5 


24.5 


22.2 


6.6 


4.2 


4.2 


X3 


Combining Sentences 


30 


40 


18.3 


23.0 


22.1 


6.2 


4.1 


3.3 


X4 


Paragraph Completion 


50 


50 


24.5 


32.5 


28.9 


9.4 


6.8 


7.4 




(N) 






(91) 


(145) 


(197) 








Note: 


Mean total scores on TEF4 


were 


411, 483, 


480, and 479 


for Peru, Chile, 


Japan, and 


the 


"World," 




respectively. 















\ocabulary and Writing Ability subsections were not separately tlned* 



Q "World" data are from a spaced sample of 1000 of the 14,134 candidates who took the October 1971 TOEFL. 



The rather consistent differences observed between subject groups in 
stsnderd devlstlon values systeutlcelly Influenced the Intercorreletlone 
smong the messures discussed belov* Peruvlsn stsndsrd deviations were 
Isrgest for sll five TOEFL subtests snd for the four Experiaentel TOEFL 
subtests* Chilean stsndard devistions vere larger than those of the 
Japanese for four of the five regular TOEFL measures » but for only half of 
the ficperittental TOEFL messures* 

The total TOEFL score nesns were nesrly identicsl for the Chileant 
Jspsnesst and World samples » with the Peruvisn nesn sbout 70 points loner* 
These relative magnitudes generally approximated those sveraged over 
several years for the respective geogrsphic sreas as reported in the 1973 
Manual for TOEFL Score Recipients * Given the difference in total TOEFL 
score t it is not surprising that the Peruvisn mesns v«re lowest on all 
nine objective measures* When comparing Chilean and Japanese means on 
TOEFL» it is interesting to note that the Jspanese were slightly higher on 
Listening Comprehension » English Structure » snd Writing Ability » but 
scored about one-hslf stsndard deviation below Chilesns on Vocsbulsry* 
When we note further that on Vocsbulary the Peruvian mean was nearly equal 
to that of the Japsnese» it would seem that s vocabulary test puts the 
Japsnese st a disadvantage when compared to Spanish-background subjects* 
This observation rsises the question of whether such differences in test 
scores associsted with the candidates' first language should be considered 
as evidence of possible unfairness of specific subtesta in a test of 
English as a foreign language* What if^ for exsmple^ sdvantages sssociated 
with cognates and similarities in idiom allowed subjects from Indo-European 
language backgrounds to score higher on vocabulary than subjects from 
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non-Indo-European language backgrounds, although they did not score higher 
in certain other language areas such as "English Structure"? It would 
seem reasonable to suggest that such differences do not necessarily imply 
unfairness. Ideally, a test of English as a foreign language will assess 
the foreign student's English language skills needed for study at Ameri- 
can campuses, regardless of how difficult or how easy it is for him to 
acquire these skills. 

Mean score differences observed for the TOEFL Listening and Vocabulary 
measures were also shown for their respective subtests. However, the 
better showing of the Japanese in Writing Ability appeared only in the 
Error Recognition (T5a) component. 

Experimental Section XI, Sentence Comprehension, parallels TOEFL 
subsection Tla, Listening Comprehension/Sentences , both in content and 
difficulty level. However, the stimuli in Tla were presented in the 
spoken mode, whereas parallel stimuli in experimental section XI were 
presented in written form. In going from the spoken to the written mode 
of presentation, the Peruvian, Chilean, and Japanese mean scores rose from 
42, 55, and 60 percent of the total number of items, to 72, 90, and 89 
percent, respectively. These marked shifts suggest that much of the 
difficulty in answering TOEFL Listening Comprehension/Sentences items was 
indeed due to the listening component of the task. 

Reliability indices for the multiple-choice measures are shown in 
Table 5. The reliabilities were generally high on all nine measures for 
each subject group. In all instances the Peruvian reliabilities were 
highest; the Chilean indices usually fell between the Peruvian and 



Table 5 

Reliability Indices^ of Multiple-Choice Measures 









Subject group 




Multiple-choice measure 


Peru 


Chile 


Japan 


World 


Regular TOEFL 


Tl 


Listening Comprehension 


.94 


.93 


.85 


.88 


T2 


English Structure 


.92 


.89 


.78 


.83 


T3 


Vocabulary 


.88 


.81 


.82 


.83 


T4 


Reading Comprehension 


.86 


.79 


.69 


.79 


T5 


Writing Ability 


,89 


.86 


.76 


.82 




(N) 


(98) 


(145) 


(199) 


(1000) 


Experimental TOEFL 


XI 


Sentence Comprehension 


.91 


.79 


.71 




X2 


Words in Context 


.90 


.81 


.76 




X3 


Combining Sentences 


.88 


.77 


.66 




X4 


Paragraph Completion 


.90 


.83 


.83 






(N) 


(91) 


(145) 


(197) 





Computed by Kxuler-Rlchardson formula 20. 
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Japanese in magnitude. These betveeu- country differences would be expected 
as a result of the restrlctlon-of-range Indicated by the differences In 
standard deviation for observed scores among the respective subject 
groups. 

Open-Ended Objective Measures 

Summary statistics for the Hunt aluminum passage scores (High K's and 
Words per T-Unlt) and for the Cloze passage scores (Clozentropy and 
Standard Cloze) are given In Table 6. Mean scores for each of the measures 
derived from the Hunt aluminum passage task were nearly the same for Chile 
as for Japan, but those for Peru were again lower. The mean Words per 
T-Unlt scores for Peru, Chile, and Japan of 9.7, 11.0, and 10.9 may be 
compared to those observed by Hunt (1970b) for American children In grades 
4, 8, and 12 of 8.6, 11.5, and 14.4. 

For the measures derived from Interview and essay judgments and from 
the Cloze passages, the standard deviations showed a marked rank-ordering, 
with Peruvian standard deviations greater than the Chilean ones, which In 
turn were substantially greater than those for the Japanese. The Hunt 
measures, however, did not follow this pattern. The standard deviation 
for High K's (HI) was greatest for Peru, as expected, but least for Chile. 
For the Words per T-Unlt (H2) measure, the Japanese standard deviation was 
smallest, as expected, but the Chilean standard deviation slightly exceeded 
that for Peru. 

In considering the above findings. It should be noted that the High 
K's and Words per T-Unlt scores are such that a candidate at a given level 
of writing ability could obtain a relatively high score on either of the 
measures by a strategy that would lower his score on the other. Thus, 
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Table 6 



Means, Standard Deviations, and Reliabilities of 
Open-Ended Objective Measures and of Subjective Measures 







Mean 




Standard deviation 




Reliability 




Measure 


Peru 


Chile 


Japan 


Peru 


Chile 


Japan 


Peru 


Chile Japan 




Open-ended 


objective measures 












Hunt Aluminum Passage 

HI High K^s 
H2 Words per T-Unit 
(N) 


24.9 
9.7 

(95) 


27.2 
11.0 
(143) 


26.3 
10.9 
(192) 


6.33 
2.86 


4.15 
2.95 


4.59 
2.21 


.96 
.98 


.84 
.97 


.92 
.86 


Cloze Passages 

CI Clozentropy score 
C2 Standard Cloze score 
(N) 


39.6 
13.6 
(95) 


60.6 
19.8 
(143) 


51.1 
16.4 

(192) 


25.0 
8.06 


19.6 
7.04 


16.7 
5.87 


.91 
.82 


.88 
.85 


.78 
.79 


Subjective measures 


Interview Judgments 

11 Grammar 

12 Vocabulary 

13 Overall Communication 
(N) 


3.09 
3.12 
3.18 
(96) 


3.81 
3.97 
4.06 
(140) 


3.48 
3.56 
3.61 
(187) 


1.13 
1.24 
1.26 


.91 
.94 
1.00 


.65 
.72 
.76 


.94 
.95 
.94 


.89 
.89 
.89 


.73 
.75 
.74 


Essay Judgments 

El Content 
E2 Form 
(N) 


2.87 
2.71 
(95) 


3.73 
3.62 
(143) 


3.25 
3.24 
(192) 


1.21 
1.27 


.94 
1.10 


.66 
.73 


.98 
.98 


.96 
.92 


.91 
.91 
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if a cautious student limited his rewriting of the Hunt passage to the 
Introduction of a few coordinating conjunctions, he was likely to receive 
a High K's score at or near the maximum of 32, but a very low Words per 
T-Unlt score of perhaps 5. The same person could Instead be more daring 
and construct longer T-unlts, but at the risk of Introducing ambiguities 
and confusions resulting In a reduction In his High K's score. This added 
source of variability may have had a role In the different rank-orderlngs 
In standard dc^vlatlon noted earlier* 

Differences between Clozentropy (CI) and Standard Cloze (C2) scores 
are attributable entirely to the method of scoring. Standard deviations 
on both measures followed the expected pattern, with those for Peru 
substantially larger than those for Japan* Means also followed a now 
familiar pattern, greatest for Chile and lowest for Peru* 

Differences in reliability, by country and by scoring method, are of 
particular interest. For the Clozentropy score, the reliability was .91 
for Peru, .88 for Chile, and .78 for Japan, a pattern which was about as 
expected, given the differences in standard deviation. However, this 
pattern did not hold for the Standard Cloze reliabilities. Assuming that 
the Chilean and Japanese reliabilities for C2 were about what would be 
expected, the Peruvian reliability was substantially less than one 
would expect, particularly since the typical pattern of standard deviations 
held for C2 scores. The magnitude of this disparity may best be illustrated 
by comparing the efficiency of the two scoring methods with respect to the 
Peruvian subjects. Using known relationships between test length and 
reliability, the 50-item test would have to be expanded to more than 100 
items in order for the Standard Cloze reliability of .82 to be increased 
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to the .91 value found for the Clozentropy score. The most likely 
reason for this difference is that, under the Standard Cloze scoring 
procedure, many items were failed by all or nearly all low-scoring subjects 
and, as a result, those items provided little or no discrimination among 
those subjects. This effect diminishes, of course, for higher-scoring 
groups. On the other hand, because the Clozentropy scoring procedure 
allows varying degrees of partial credit for many of the responses that 
are scored zero in the Standard Cloze procedures, it is not subject to 
this pronounced effect on score reliability of the subjects' ability 
level . 

Subjective Measures 

Means, standard deviations, and reliabilities of the interview and 
essay judgments are also given in Table 6. It is evident that for each 
subject group the statistics are highly similar over the interview scales 
shown; Grammar (11), Vocabulary (12), and Overall Communication (13). 
Given the care taken in the judgment procedures to avoid a halo effect, it 
is likely that these English-speaking skills were in fact very closely 
related for all three groups of subjects. 

As was noted for the objective measures, there was a pronounced 
difference in score variability, with standard deviations greatest for 
Peru and least for Japan. Also as noted before, the differences in 
reliability are attributable almost entirely to these differences in 
variance. Mean intexrview scores for Peru, Chile, and Japan were about 
3, 4, and 3 1/2, respectively. The differences were sizeable in both a 
normative and a criterion-referenced sense, with mean differences between 
Chile and Peru of about one standard deviation, representing one full 
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level on the 8lx*»polnt criterion scale given In Appendix B. Reliabilities 
for the three interview scales averaged .94 for Peruvians, .89 for Chileans, 
and .74 for Japanese candidates. 

It will be recalled that each subject wrote four essays, and that 
each of his essays was Judged by eight readers, two for each of four 
scales. Content (Quantity), Content (Quality), Form (Quantity), and Form 
(Quality) • By combining ratings over the four essays, there were eight 
ratings per candidate on each of the four scales. Because preliminary 
analyses of the readings showed virttially no distinction between the two 
Content scores, nor between the two Form scores, the scales were combined 
to an overall Essay Content (El) and an overall Essay Form (E2) scale. 
Since the eight readers providing the Content (Quantity) ratings of each 
candidate's essay were Independent of those giving the Content (Quality) 
ratings, the correlation between the two pooled sets of ratings provided a 
conservative procedure for estimating reliability. The reliability 
estimates thus obtained for overall Essay Content ratings were .98 for 
Peruvians, .96 for Chileans, and .91 for the Japanese. Following the same 
procedure, reliabilities for overall Essay Form ratings were estimated at 
.98, .92, and .91 for the same groups of candidates. 

The pattern of m^ans, standard deviations, and reliabilities for the 
essay Judgments resemble those noted above for the interview Judgments. 
Mean differences were also of the same general magnitude, with the three 
groups separated by intervals of about one«-half standard deviation. 



62 



••52- 



Intercorrelatlons Among Test Scores 

The relationships among scores, and In particular the Intercorrelatlons 
between those which may be compared as predictors and their crlterlat are 
central to the main purpose of this study; that Is, to provide an empirical 
data base to facilitate the review and possible revision of the content 
specifications of a test of English as a foreign language. 

For convenience, data will again be presented In the groupings of 
multiple-choice, open-ended objective, and subjective measures. To a 
degree this division among groups of measures, based on how they are 
scored, may be equated to a functional division between predictive and 
criterion usage. This correspondence' Is In part because factors of 
cost and feasibility favor objective measures for ultimate selection as 
predictors in any test that is to be administered on a large scale, and in 
part because subjective measures may serve better than objective measures 
to approximate such pragmatically important criteria as teachers' Judgments 
about their student's ability to write well in English. However, these 
considerations do not necessarily rule out the use of certain objective 
measures, such as multiple-choice tests of reading, as criteria. 

In examining the various intercorrelatlons, the objective TOEFL 
measures for the receptive skills of listening and reading will be treated 
as criteria, as will subjective measures of the productive language skills 
of speaking and writing. 

The emphasis in the following discussion will be upon examining and 
comparing individual intercorrelation values. Given that emphasis, it 
would be natural to put a premium on high correlations and, of course, 
high correlations between predictive and criterion measures are very 
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desirable* However • it should be noted that ultimately we are Interested 
in selecting a combination of measures £or a test» and £or that purpose, 
lower Intercorrelatlons among multiple predictors are desirable. That 
Is, each predictor should contribute score variance related to Its criterion 
that Is unique to the score variance reflected In the other predictors. 
In £act, a given predictive measure which correlates only modestly with 
the criteria may, if it taps a unique source of criterion-related variance, 
be given a high statistical priority for inclusion in a combined predictive 
test. 

As may be seen in Tables 5 and 6, reliability estimates varied among 
the several measures and among the subject groups. The reliability of 
each measure influenced its correlation with other measures, and was 
itself influenced by test length and by the range of subjects^ scores on 
that measure. To allow comparisons among intercorrelatlons not Influenced 
by these differences, each set of observed intercorrelatlons is followed 
by a corresponding set corrected for attenuation due to unreliability* 

Correlations Among Multiple^Choice Scores ^and Subscores 

Observed intercorrelatlons among the nine multiple-choice scores are 
given in Table 7. Comparing these data by subject group, it is evident 
that the Peruvian correlations tended to be greatest and the Japanese 
ones to be smallest. "World" intercorrelatlons tended to fall between the 
Spanish and Japanese values. No general pattern over subject groups seems 
evident when data for one measure are compared with those for another 
measure.* 
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Table 7 

Observed Intercorrelatlon* aaong Multiple-Choice Measures 



K€«8ure 


Subject 
group 




Regular 


TOEFL 




Experlaental TOEFL 


Tl 


T2 


T3 


T4 


T5 


XI 


X2 


X3 


X4 


Tl 


Peru 




87 


83 


79 


84 


TO 

72 


O A 

78 


74 


O^ 

81 


Listttnlng 


Chile 




83 


65 


62 


^ o 

73 


68 


61 


65 


61 


Comprehension 


Japan 




65 


51 


56 


52 


62 


56 


51 


53 




World 





03 




64 


58 










TZ 


Peru 


0/ 




/o 


75 


OO 

o8 


/U 


11 


/I 


70 
/S 


cngxitn 


Cnile 


OJ 




72 


bl 


OH 


74 


71 


74 


71 


Structure 


Japan 


65 


— 


62 


62 


63 


60 


65 


61 


57 




World 


63 




CO 

00 


67 


77 










T3 


Peru 


o3 


lb 




o o 

o3 


79 


lb 


QO 

o3 




07 
O/ 


Vocabulary 


Chile 


OD 


72 


— 


T O 

72 


To 

72 


59 


69 


69 


66 




Japan 


51 


62 




66 


66 


46 


73 


59 


71 




World 


54 


68 




68 


67 










TA 


Peru 


79 


75 


83 




79 


81 


84 


A O 

83 


A 

83 


Reading 


Chile 


62 


67 


72 




73 


^ A 

68 


75 


76 


75 


Comprehension 


Japan 


56 


62 


66 




61 


55 


73 


70 


73 




World 


64 


67 


68 


II 


67 










T5 


Peru 


84 


88 


79 


79 




73 


80 


75 


A^ 

81 


Writing 


Chile 


73 


84 


72 


73 




64 


66 


74 


74 


Ability 


Japan 


52 


63 


66 


61 


II 


51 


67 


59 


60 




World 


58 


77 


67 


67 


— 










XI 


Peru 


72 


70 


76 


81 


73 




A O 

83 


A e 

85 


77 


Sentence 


Chile 


68 


Ik 




68 


64 




7» 
/I 






Comprehension 


Japan 


02 


60 


46 




51 


II 


56 


53 


56 


X2 


Peru 


78 


77 


83 


84 


80 


83 




86 


85 


Words In 


Chile 


61 


71 


69 


75 


66 


71 




71 


72 


Context 


Japan 


56 


65 


73 


73 


67 


56 




64 


77 


X3 


Peru 


74 


71 


80 


83 


75 


85 


86 




85 


Combining 


Chile 


65 


74 


69 


76 


74 


74 


71 




73 


Sentences 


Japan 


51 


61 


59 


70 


59 


53 


64 




71 




Peru 


81 


79 


87 


83 


81 


77 


85 


85 




Paragraph 


Chile 


61 


71 


66 


75 


74 


65 


72 


73 




Completion 


Japan 


53 


57 


71 


73 


60 


56 


77 


71 





65 



55 



Intercorrclationt among the TOEFL and Escpcrlaental TOEFL scorea 
corrcaponding to thoac ahovn In Table 7, but corrected for attenuation^ 
ar« given in Table 8* Following the correctiona for attenuation^ the 
percentage of inatancea in vhich the Peruvian correlationa exceeded thoae 
of the Chileana by aK>re than •OS dropped from about 90 to about 50» 
and the percentage of inatancea favoring Chile over Japan dropped from 
about 75 to about 50* Other patterna of relationahip then emerged that 
were not apparent from an inapection of uncorrected correlationa* Aa 
might be expected » the Liatening Comprehenaion correlationa tended to be 
loveatt auggeating that the liatening meaaure tapped language akilla 
not repreaented in the meaaurea preaented in the written mode* Thia 
relative independence between meaaurea in the liatening and written modea 
was eapecially pronounced for the Japaneae* The Japaneae corrected 
correlationa between Liatening Comprehenaion and the written-format 
objective meaaurea ranged from .61 to •Sl^ whereaa the range of thoae 
for Peru %raa from •78 to •9A» and thoae for Chile» from •68 to •91 • An 
exception to thia pattern for each Hiapanic group was the relatively high 
correlation between Liatening Comprehenaion and Engliah Structure (r« •94 
for Peruviana and ^91 for Chileana) • The other exception^ for Peruviana 
only» waa the Listening CoRprehen8ion->«-ttriting Ability correlation of •92* 

The differences among corrected correlationa aaaociated with the 
three subject groups may well indicate real differences in the extent 
that the language abilities in question are less related for some foreign 
students than for othera* A reasonable conjecture is that the relation- 
ships among component Engliah language skills may tend to be lower for 
Japanese students than for those having an Indo-European language background* 
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Tabu 8 

IntcrcorrtlatloD* aaons MultlpU-Cholc* MM«ur««, Corrtctsd for Attenuation 



Mmsutc 


Subject 
group 




Ratul 


sv TOEFL 




Expariaantal TOEFL 


Tl 


T2 


T3 


T4 


T5 


XI 


X2 


X3 


X4 


Tl 


Peru 




94 


91 


88 


92 


78 


85 


81 


88 


Llstming 


Chile 




91 


73 


73 


82 


80 


68 


77 


69 


Coapr^hMslon 


Japan 


— 


80 


61 


74 


65 


81 


69 


68 


63 




World 




74 


63 


77 


67 










T2 


Peru 


94 




84 


84 


97 


77 


85 


79 


87 


English 


Chile 


91 


— 


85 


80 


96 


88 


83 


90 


83 


Structure 


Japan 


80 


— 


78 


85 


82 


80 


84 


85 


71 




World 


74 


— 


82 


83 


93 










T3 


Peru 


91 


84 




95 


89 


85 


93 


91 


98 


Vocabulary 


Chile 


73 


85 




90 


86 


74 


85 


88 


SO 




Japan 


61 


78 




88 


84 


60 


93 


80 


86 




World 


63 


82 




84 


81 










T4 


Peru 


88 


84 


95 




90 


91 


96 


95 


94 


Raad ins 


Chile 


73 


80 


90 




89 


86 


94 


98 


93 


CoBDrahAiisloii 


Japan 


74 


85 


88 


— 


84 


78 


99 


99 


95 




World 


77 


83 


84 




83 










T5 


Peru 


92 


97 


89 


90 


— 


81 


90 


85 


90 




Chile 


82 


96 


86 


89 




78 


79 


91 


88 


Abil-ttv 


Japan 


65 


82 


84 


84 




68 


88 


83 


76 




World 


67 


93 


81 


83 


— 










XI 


Peru 


78 


77 


85 


91 


81 




92 


AC 

95 


DC 

85 


Sentence 


Chile 


80 


88 


74 


86 


78 




89 


95 


80 


Conn rahenaloii 


Japan 


81 


SO 


60 


78 


68 




77 


77 


73 


X2 


Peru 


85 


85 


93 


96 


90 


92 




97 


94 


Words in 


Chile 


68 


83 


85 


94 


79 


89 




90 


88 


Context 


Japan 


69 


84 


93 


99 


88 


77 




91 


97 


X3 


Peru 


81 


75 


91 


95 


85 


95 


97 




95 


Conblning 


Chile 


77 


90 


88 


98 


91 


95 


90 




92 


Sentences 


Japan 


68 


85 


80 


99 


83 


77 


91 




96 




Peru 


88 


87 


98 


94 


90 


85 


94 


95 




Psrsgrsph 


Chile 


69 


83 


80 


93 


88 


80 


88 


92 




Co^iletlon 


Japan 


63 


71 


86 


95 


76 


73 


97 


96 
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Howftvcr, no coapclling suggestion coms to mind that would explain the 
fact thtt the Peruvian correlations still tended to be soaevhat greater 
than those for the Chilean subjects* 

Ihe Reading Comprehension measure may be considered a criterion or 
quasi-criteriont with some of the other objective scores treated either as 
estimators or as potential alternative measures* Sentence Comprehension 
(XI) would be expected to correlate well with Reading Comprehension (T4)* 
As an efficient measure with face validity for testing reading comprehen- 
sion, XI could be an effective supplement to the less efficient T4 measure* 
Corrected correlations between XI and T4 were high for both Spanish groups 
(*91 and *86 for Peruvians and Chileans), and moderate (*78) for Japanese 
subjects, but these were not exceptional among the adjusted correlation 
values* They would probably have been higher, particularly for the 
Japanese, if the Sentence Comprehension test had included a greater nuiiber 
of difficult items* There was virtually no discrimination in Sentence 
Comprehension scores among the more able students, because most of them 
missed only two or less of the thirty items* 

Of particular interest is the very high relationship between Reading 
Comprehension (TA) and the three remaining Experimental TOEFL measures-** 
Words in Context (X2), Combining Sentence (X3), and Paragraph Completion 
(X4), with corrected correlations in the *93 to *99 range* The Words in 
Context measure is basically a vocabulary test which provides contex«* 
tual Information that must be used to answer the question fully* Like 
Sentence Comprehension (XI), it does not go beyond the sentence level* 
Apparently, whatever ability the Reading Comprehension score indicates 
regarding beyond-sentence comprehension, that ability was closely related 
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to the wlthln-sentence comprehension skills for all three groups of 
subjects used in this study. The Paragraph Completion (X4) measure is a 
multiple-choice equivalent of the Cloze task. The high relationship 
between it and Reading Comprehension scores gives empirical support to the 
suggestion that the Cloze task draws on the same underlying abilities as 
the more conventional tests of reading. The high correlations between 
Reading Comprehension and the Combining Sentence (X3) task is harder to 
account for. The latter calls for the ability to recognize appropriate 
combinations of short statements that have been embedded into a single 
sentence. Although decoding a structurally complex sentence is no doubt 
vital to effective reading, the Combining Sentences item type is not 
intended to tap vocabulary. Finally, Reading Comprehension and Vocabulary 
were highly correlated, with correlations of .95, .90, .88, and .84 for 
the four subject groups. 

In summary, the ability to comprehend reading passages was very 
closely approximated, for all three subject groups, by measures of sentence 
comprehension, of selecting missing words for passages, of vocabulary 
(especially when context plays a role), and of embedding component sentences 
into a longer sentence semantically equivalent to its components. It may 

be useful to note that in all of the measures described in the above 
paragraph, the emphasis is on the comprehension of meaning, with little 
emphasis given to knowledge of standard usage, per se . 

Among the nine objective measures being compared, the two which most 
emphasize standard usage are English Structure and Writing Ability. This 
may account for the very high adjusted correlations between these measures 
of •97 for Peru, .96 for Chile, and .93 for the World sample. For the 




Japanese, the correlation was .SZ* 

Observed and corrected Intercorrelatlons between the subscores for 
the measures of Listening Comprehension, Vocabulary, and Writing Ability 
arc given In Table 9. Corrected correlations ranging from .97 to .99 
Indicate that the first two Listening Comprehension scores. Sentences and 
Dialogues, gave nearly Identical information. The Lecture subtest provided 
some unique score variance, as Indicated by the corrected correlations 
between the Lecture subtest and the Sentences and Dialogues subtests, 
which ranged from .90 to .94 for the Hispanic students, and from .85 to 
.90 for the Japanese. 

The two Vocabulary subtests. Sentence Completion and Synonyms, differ 
In that the former provides context but the latter does not. Corrected 
correlations between these subtests were very high for Peru and Japan at 
.99 and .93, respectively. For Chile, the correlation was .77. 

The corrected correlations between the Writing Ability subtests. Error 
Recognition and Sentence Completion, were .84, .70, and .62 for Peru, 
Chile, and Japan. It Is interesting to note that, almost without exception, 
these correlations between the two subtests within the Writing Ability 
section were less than those observed between Writing Ability and the 
other separately scored sections of TOEFL* 

Correlations Among Open-Ended Objective Measures and Subjective Measures 

As noted earlier, the Interview and essay measures were developed as 
criteria for the productive language skills of speaking and writing* 
Rellogg Hunt's measure of syntactic maturity. Words per T-Unlt, has been 
found effective for comparing English writing skills among native English 
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Table 9 



Observed and Corrected Intercorrelatlons 
among Regular TOEFL Subsections 





TOEFL Subsection 


Observed 
intercorrelatlons 

Peru Chile Japan 


Corrected 
intercorrelatlons 
Peru Chile Japan 


Listening Comprehension 












Tla Sentences vs* 
Tib Dialogues 


87 


81 


66 


99 


97 97 


Tla Sentences vs* 
Tic Lecture 


74 


71 


58 


90 


90 85 


Tib Dialogues vs. 
Tic Lecture 


77 


70 


57 


94 


91 90 


Vocabulary 












T3a Sentence Completion vs. 
T3b Synonyms 


77 


54 


63 


99 


77 93 


Writing Ability 












T5a Error Recognition vs. 
T5b Sentence Completion 


68 


56 


42 


84 


70 62 



71 
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speakers (Hunt, 1970b)* In the present discussion, then, its value as 
an estimator of the ability to write English as a foreign language will 
be examined • Since there has also been considerable interest in the 
utility of the Cloze procedure for assessing language skills, the Cloze 
task and two methods of scoring the subjects' responses will be examined 
in this respect* 

The pattern in which the largest observed correlations generally 
appeared for Peru and the smallest for Japan was even more pronounced for 
the data in Table 10 than it was for the multiple-choice measures* 
Another pattern in Table 10 is that the Clozentropy correlations were 
generally higher than the corresponding Standard Cloze values* Also of 
interest are the correlations among the three interview measures and those 
between the Cloze scores. The remarkable similarity of summary statistics 
among the interview measures was noted earlier* The siiggestion given 
then, that the interview scores — Grammar (II), Vocabulary (12), and 
Communication (I3)~are closely related, is borne out by the very high 
observed intercorrelations among these interview ratings* Similarly, the 
high observed correlations between the Clozentropy and Standard Cloze 
scores show that for each subject group the two scoring methods were 
clearly equivalent in how they rank-ordered the students* It should be 
recalled however that the two scoring procedures differed substantially in 
efficiency, particularly for low-scoring students* 

The corrected correlations among the open-ended objective scores and 
subjective scores may be examined in Table 11* Again, the systematic 
differences associated with subject group did not simply drop out as a 
result of the correction for unreliability* In nearly all instances. 





Table 10 



Observed Intercorrelatlons among Open-Ended 
Objective Measures and Subjective Measures 



Open-ended 

Subject objective measures Subjective measures 



neaoux^e 


group 


nX 


119 




P9 


T 1 
XX 


TO 






ISZ 


HI 


Peru 


— 


30 


51 


49 


57 


56 


56 


52 


52 


Hunt's 


Chile 


— 


24 


51 


44 


33 


34 


32 


32 


39 


High K's 


Japan 




-04 


34 


30 


24 


26 


27 


21 


27 


H2 


Peru 


30 




41 


30 


51 


52 


50 


55 


51 


Hunt's 


Chile 


24 


— 


49 


52 


42 


46 


43 


42 


43 


Words/T 


Japan 


-04 


— 


24 


23 


08 


12 


07 


07 


17 


CI 


Peru 


51 


41 


— 


98 


77 


76 


78 


81 


88 


Clozentropy 


Chile 


51 


49 


— 


97 


67 


69 


69 


71 


80 




Japan 


j4 


z4 




yb 




A C 

4d 


A O 

43 


CO 


bo 


C2 


Peru 


49 


30 


98 


— 


71 


69 


72 


76 


85 


Standard 


Chile 


44 


52 


97 


— 


66 


68 


67 


69 


77 


Cloze 


Japan 


30 


23 


96 




44 


43 


40 


46 


66 


Ik 


Peru 


57 


51 


77 


71 


— 


97 


97 


87 


87 


Interview: 


Chile 


33 


42 


67 


66 


— 


95 


95 


75 


83 


vrranBDar 


Japan 


9 A 


HQ 
Uo 


hD 


A A 






Q1 
7X 




DO 




Peru 


56 


52 


76 


69 


97 


— 


98 


87 


86 


Interview: 


Chile 


34 


46 


69 


68 


95 


— 


97 


76 


84 


V v/V>oU UXclA. jr 


Ton An 


^ u 




HO 


HJ 








•t U 


ss 


13 


Peru 


56 


50 


78 


72 


97 


98 




87 


86 


Intexrvlew: 


Chile 


32 


43 


69 


67 


95 


97 




77 


83 


Communication 


Japan 


27 


07 


43 


40 


91 


96 




43 


52 




Peru 


52 


55 


81 


76 


87 


87 


87 




94 


Essay: 


Chile 


32 


42 


71 


69 


75 


76 


77 




89 


Content 


Japan 


21 


07 


52 


46 


49 


46 


43 




76 


E2 


Peru 


52 


51 


88 


85 


87 


86 


86 


94 




Essay: 


Chile 


39 


43 


80 


77 


83 


84 


83 


89 




Form 


Japan 


27 


17 


68 


66 


58 


55 


52 


76 





73 
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the Japanese correlations In Table 11 remained the smallest; about three- 
fourths of the Chilean correlations were smaller than the Peruvian ones. 
The discussion of possible reasons for these subject-group differences 
in correlations, given in regard to the corrected correlations of the 
objective scores in Table 8, applies to the data in Table 11 as well. 

The pattern in which Clozentropy correlations consistently exceeded 
those for Standard Cloze scores disappeared once the corrections for 
unreliability were made. As suggested earlier, the real difference in the 
effect of the scoring procedures was that the Clozentropy procedure 
yielded greater reliability and, therefore, greater efficiency, with 
respect to item-writing, testing time, etc. It should be remembered, 
however, that the Clozentropy scoring procedure is much less efficient 
than the Standard Cloze procedure with respect to the time and cost of 
carrying out the scoring itself. 

Even with corrections for attenuation, correlations between Hunt's 
Words per T-Unit (H2) and either of the Essay measures. El and E2, were 
only middling for the Spanish groups (.56 and .52 for Peruvians; .44 and 
.45 for Chileans) and very low for the Japanese (.08 and .19). The 
tendency for some students at a given level to favor clarity of expression 
(and thus attain a near-perfect "High K's" score) at the expense of using 
only very short T-units, and for others to do the opposite, probably 
explains the poor showing of Words per T-Unit as an indicator of English 
writing ability. Whatever the reason, correlations involving either High 
K's or Words per T-Onit generally showed a marked drop from Peruvian to 
Chilean subjects, and again from Chilean to Japanese. 






- 64 - 



Table 11 

Intercorrelatlons among Open-Ended Objective Measures 
and Subjective Measures » Corrected for Attenuation 



Open-ended 

Subject objective measures Subjective measures 





group 


nx 




n 

LiX 




T1 
XX 


T9 




T?1 






Peru 


— 


31 


55 


55 


60 


59 


59 


54 


54 


nunc s 


OnlXe 




ZD 






JO 


jy 




DO 


A A 


-124 /»U V^a 


Japan 








•a 1^ 
Jj 








LD 


'in 


H2 


Peru 


31 




43 


33 


53 


54 


52 


56 


52 


Hunt^s 


Chile 


26 




53 


57 


45 


49 


46 


44 


45 


Words/T 


Japan 


-04 




29 


28 


10 


15 


09 


08 


19 


ci 


Peru 


55 


43 





99 


83 


82 


84 


85 


93 


Clozentropy 


Chile 


59 


53 


— 


99 


76 


78 


78 


77 


89 




Japan 


40 


29 




99 


61 


60 


C "J 

51 


62 


oU 


C2 


Peru 


55 


33 


99 




81 


77 


82 


84 


94 


Standard 


Chile 


52 


57 


99 


— 


76 


78 


77 


76 


87 


Cloze 


Japan 


35 


28 


f\f\ 

99 




C Q 

DO 


DO 


DL 


D^ 


7Q 

/o 


U 


Peru 


60 


53 


83 


81 


— 


99 


99 


91 


91 


Interview: 


Chile 


O Q 

3o 




Til 
/O 


/ O 




oo 


oo 

yy 


Q1 
oX 


on 


Grammar 


Japan 




xu 


OX 


DO 




oo 
yy 


QQ 

yy 


ou 


/X 


11 


Peru 


59 


54 


82 


77 


99 


— 


99 


90 


89 


Interview: 


LnlXe 




/. o 


/o 


7Q 
/O 


oo 

yy 




QQ 

yy 




Q9 


Vocabulary 


Japan 


^X 


X3 






QQ 

yy 




QQ 

yy 




O / 


13 


Peru 


59 


52 


84 


82 


99 


99 




91 


90 


Interview: 


Chile 


37 


46 


78 


77 


99 


99 




83 


92 


Communication 


Japan 


33 


09 


57 


52 


99 


99 




52 


63 


El 


Peru 


54 


56 


85 


8'* 


91 


90 


91 




96 


Essay: 


Chile 


36 


44 


77 


76 


80 


82 


83 




95 


Content 


Japan 


23 


08 


62 


54 


60 


56 


52 




84 


E2 


Peru 


54 


52 


93 


94 


91 


89 


90 


96 




Essay: 


Chile 


44 


45 


89 


87 


90 


92 


92 


95 




Form 


Japan 


30 


19 


80 


78 


71 


67 


63 


84 





75 
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The two Cloze measures perforaed very well as indicators of essay 
writing ability. Appropriately, the correlations were highest with 
respect to Essay Form (E2), with that measure and Clozentropy correlating 
•93, •89, and .80 for Peruvians, Chileans, and Japanese, respectively* 

Correlations Between Multiple-Choice Scores and Other Scores 

The observed correlations between multiple-choice scores on the one 
hand, and open-ended objective scores and subjective scores on the other, 
will be considered next. The interview and essay measures, and to a 
limited degree the Cloze measures, will be regarded as criteria for 
selected multiple-choice measures. The observed correlations in Table 12 
followed patterns generally consistent with those observed earlier* 
Peruvian correlations tended to be highest and Japanese correlations 
lowest, and Clozentropy scores usually correlated slightly more with 
the objective measures than did the Standard Cloze scores* 

Turning to the corrected correlations in Table 13, note that for any 
given objective measure the correlations with the three interview scales 



were nearly identical. For convenience, then. Communication (13) will be 
used in subsequent discussions to represent the speaking criteria* 

The Communication measure came nearer than any other in the study to 
serving as a criterion for Listening Comprehension. The correlations were 
gratifying ly high, with the *82 observed for the Japanese being the 



largest correlation for that language group between spoken communication 
and the several objective measures. The second best predictor of 13 for 
the Japanese was English Structure (r .71). For Peruvian and Chilean 
subjects, the corrected correlations between Listening Comprehension 
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Table 12 

Observed Correlations between Multiple-Choice Measures, 
and Open-Ended Objective Measures and Subjective Measures 



Open-ended 

objective measures Subjective measures 



Multiple- 
choice 
measure 


Subject 
group 


Hunt 


PS2« 


Cloze 


PSK8« 




Interview 


Essays 


HI 


H2 


CI 


C2 


11 


12 


13 


El 


B2 


High 
K's 


Wds/ 
T 


Cloz- 
entr. 


Std. 
Cloze 


Gram- 
mar 


• Vo- 
cab • 


Com- 
mun« 


Con- 
uenu 


rom 


Tl 


Peru 


45 


43 


81 


80 


79 


79 


79 


80 


87 


Listening 


Chile 


34 


34 


68 


65 


69 


68 


71 


72 


77 ; 


Comprehension 


Japan 


20 


12 


52 


51 


66 


oo 


OS 




A^ '\ 


T2 


Peru 


44 


43 


83 


78 


80 


81 


81 


82 




English 


Chile 


36 


47 


82 


79 


78 


77 


77 


81 




Structure 


Japan 


26 


09 


64 


62 


53 


53 


54 


46 


68 1 


T3 


Peru 


57 


42 


82 


83 


75 


# o 


7S 




7ft 
/ o 


Vocabulary 


Chile 


35 


36 


70 


70 


64 


Oh 




AA 


70 . 




Japan 


25 


31 


64 


64 


43 


49 


46 


39 


57 ' 


T4 


Peru 


56 


47 


79 


79 


79 


7Q 


7ft 
/ o 


11 


7ft 
/ o 


Reading 


Chile 


44 


47 


76 


77 


60 




A1 


SA 


A7 1 


Comprehens Ion 


Japan 


29 


18 


69 


65 


44 


45 


44 


48 


58-1 

< 


T5 


Pe^u 


45 


46 


82 


80 


79 


7ft 
/o 


70 


70 


ft7 ' 
o/ 


Writ Ins 


Chile 


38 


50 


76 


76 


69 


Aft 
oo 


AA 


71 


7ft ' 
/ o 


Ability 


Japan 


28 


22 


64 


63 


44 


47 


45 


53 


61 ' 

) 


XI 


Peru 


64 


44 


82 


79 


80 


79 


80 


77 


78 


Sentence 


Chile 


46 


36 


77 


73 


64 


67 


66 


70 


74 


Comprehension 


Japan 


32 


03 


55 


51 


47 


47 


48 


38 


50 


X2 


Peru 


55 


46 


87 


84 


78 


77 


79 


78 


82 ' 


Words In 


Chile 


36 


45 


79 


78 


69 


72 


71 


67 


76 . 


Context 


Japan 


27 


29 


73 


69 


47 


50 


47 


46 


61-: 


X3 


Peru 


65 


40 


82 


81 


78 


78 


79 


75 


78 


Combining 


Chile 


41 


45 


81 


80 


60 


64 


62 


65 


72 


Sentences 


Japan 


32 


16 


65 


64 


42 


43 


39 


47 


64 , 




Peru 


57 


42 


89 


89 


73 


73 


74 


76 


82 J 


Paragraph 


Chile 


35 


52 


77 


77 


58 


59 


58 


59 


68 1 


Completion 


Japan 


35 


23 


72 


70 


43 


48 


46 


45 


58 : 
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Table 13 

Correlations between Multiple-Choice Measures and Open-Ended 
Objective Measures and Subjective Measures » Corrected for Attenuation 



Open-ended 

objective measures Subjective measures 



Multiple- 
choice 
measure 


Subject 
group 


nunc 


PSg. 




P8K8* 




Interview 


Essavs 




nx 


WdsZ 

T 


CI 


C2 


11 


12 


13 


El 


E2 


High 


Cloz- 
entr* 


Std. 
Cloze 


Gram- 
mar 


Vo- 
cab* 


Con- 
inun* 


Con- 
tent 


Form 


Tl 


Peru 










OH 


84 


84 


83 


91 


Listening 


Chile 


38 


36 


75 


73 


76 


75 


78 


76 


83 


Comprehension 


Japan 


23 


14 


64 


62 


84 


83 


o2 


59 


72 


T2 


Peru 


/. 1 

HI 




Q1 


on 
y\) 


oO 


87 


87 


oo 


92 


English 


Chile 


42 


50 


93 


91 


88 


87 


87 


88 


98 


Structure 


Japan 


31 


11 


82 


79 


70 


69 


71 


55 


81 


T3 


Peru 


62 


45 


92 


98 


82 


83 


82 


80 


84 


Vocabulary 


Chile 


A3 


41 


83 


86 


77 


77 


73 


7 A 
iH 


fil 

OX 


Japan 


29 


38 


80 


80 


55 


62 


59 


HD 




T4 


Peru 


62 


51 


89 


94 


88 


87 


87 


84 


85 


Reading 


Chile 


56 


56 


91 


97 


74 


76 


73 


0/ 


7Q 

• iy 


Comprehension 


Japan 


36 


23 


94 


88 


62 


62 


62 


A1 

ox 


/J 


T5 


Peru 


49 


49 


91 


.94 


86 


85 


86 


85 


93 


Writing 


Chile 


A5 


55 


87 


89 


79 


"TO 

78 


7C 
/ J 


77 


fifi 

00 


Ability 


Japan 


33 


27 


83 


81 


Dy 


62 


60 


OH 


73 


XI 


Peru 


68 




on 


yx 


fi7 
o / 


85 


87 


81 


83 


^Sentence 


Chile 


56 


41 


92 


89 


76 


80 


79 


80 


87 


Comprehension 


Japan 


AO 


04 


74 


68 


65 


64 


66 


/■ 7 


AO 
OZ 


* X2 


Peru 


59 


49 


96 


98 


85 


83 


86 


83 


87 


Words In 


Chile 


A3 


52 


94 


96 


83 


86 


84 


78 


88 


Context 


Japan 


32 


36 


95 


89 ■ 


63 


66 


63 


55 


73 


X3 


Peru 


71 


43 


92 


95 


86 


85 


87 


81 


84 


.Combining 


Chile 


51 


52 


99 


99 


73 


78 


75 


76 


86 


Sentences 


Japan 


Al 


21 


90 


88 


60 


61 


56 


61 


83 


• X4 


Peru 


61 


45 


98 


99 


79 


79 


80 


81 


87 


Paragraph 


Chile 


A3 


60 


90 


95 


70 


71 


68 


69 


78 


Completion 


Japan 


AO 


27 


90 


86 


55 


61 


59 


52 


67 



and spoken Conummlcatlon (13) were •SA and •TS, respectively. For both 
Hispanic groups » a number of the objective measures presented In the 
written mode correlated more highly with 13 than did the Listening 
Comprehension measure* For Peruvians 9 there was a near five-way tie among 
the objective measures for estimating spoken Communication. Measures T2, 
T4, XI, and X3 correlated with the 13 criterion at .87, and Words In 
Context (X2) did so at .86. For Chileans, the best estimator of spoken 
Communication was English Structure, correlating at .S?, closely fol*» 
lowed by Words in Context at .SA* 

For estimating Essay Form (E2) scores, the English Structure test 
again performed particularly well, yielding^ corrected correlations of #92, 
•98, and .81 for Peru, Chile, and Japan, respectively. Writing Ability 
and Words in Context also performed very satisfactorily in estimating E2 
scores. 

There is an increasing body of literature suggesting that the Cloze 
procedure is effective for estiauitlng reading comprehension* Considering 
the Reading Comprehension (T4) measure as a criterion, that contention is 
strongly supported by the present data, with corrected correlations 
averaged over the two scoring procedures of about •91 for Peruvians, .95 
for Chileans, and .91 for Japanese. 

As noted early in this report, the advantages of the Cloze procedures 
are to some degree offset by the problem of administering and scoring 
open-^respons. ^ests on a large scale, as compared to using multiple-choice 
measures. T *, therefore, of interest to treat Cloze scores as criteria, 
to see wliether measures that are more readily obtained and scored can be 
used in their place. Correlations between the Paragraph Completion 
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measure, vhlch Is a multiple-choice variant of the Cloze task, and Cloze 
scores (CI and C2) were gratlfylngly hlgh» at •98 and #99 for Peruvians, 
•90 and •95 for Chileans, and .90 and .88 for the Japanese^ Interestingly, 
the Words In Context (X2) and Combining Sentences (X3) tasks also correlated 
very highly with the Cloze scores • Correlations between X2 and the Cloze 
scores averaged about •97, ^95, and ^92 for Peru, Chile, and Japan, 
respectively^ Those between X3 and the two Cloze scores averaged about 
•93, .99, and •89, for subject groups In the same orders 

Evaluations of Multiple-Choice and Open-Ended Objective Measures 

A central purpose of the study was to make comparisons among a wide 
variety of measures, both objective and subjective, to serve as either 
predictors or criteria that would be of use In specifying the content of 
an "ideal" TOEFL • To further this purpose, each of the multiple-choice 
and open-ended objective measures will now be considered In turn, with 
Information regarding Its relative merits for use In an operating TOEFL^ 
This Information will be taken from the preceding discussion of results • 

Multiple-choice measures 

Measures under Immediate consideration for their desirability as part 
of a future TOEFL are those now In TOEFL and the additional objective 
measures found In the Experimental TOEFL developed for the present study • 

Tl: Listening Comprehension ^ Criterion measures of listening 
comprehension were not developed as part of the study, nor were alternative 
experimental measures to the three subtests making up the TOEFL Listening 
Comprehension section* However, the study does provide a basis for some 
conclusions about the usefulness of this section and Its component parts* 
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Evldtnct thdt the llsttnlng furtMurts ccntrlbut^d vtrUnce not found 
nhtn Mttrl«l0 are presented in the written node is evelleble in Table 7» 
nhere ftnerelly lover correlations were found between Listening Cosprehen* 
sion and the eight wrltten«>w>de objective measures than aaong the latter. 
Furthermore, the finding of aarked reductions in itea difficulty associated 
with presenting equivalent iteas entirely in the written aode in the 
Sentence Comprehension (XI) measure, as opposed to presenting them in the 
spoken mode in Listening Comprehension/Sentences (Tie), suggests that mi«.ch 
of the difficulty in the latter is indeed attributable to the listening 
task itself, rather than to such factors as reasoning ability, general 
vocabulary, and reading ability* 

Further evidence that the Listening Comprehension (Tl) measure is 
working satisfactorily are the correlations between it and the Interview* 
Communication (13) scores shown in Tables 12 and 13* The Listening 
Comprehension measure was an effective estimator of spoken communication 
ability for the Hispanic groups, and was the best estimator of this 
ability among the objective measures for the Japanese* 

Information regarding the relationships among the three component 
measures of Listening Comprehension was also obtained* As shown in Table 
9, the Sentences and Dialogues components were so highly intercorrelated 
as to be virtually interchangeable, but the Lecture component contributed 
some unique variance* For Peruvian and Chilean subjects, the Sentences 
and Dialogues portions estimated spoken Communication (13) equally well, 
with corrected correlations for both subtests of about *88 for Peruvians 
and *78 for Chileans* The Lecture portion correlated with 13 to a much 
lesser degree » at about *71 for both groups* For the Japanese, spoken 
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CcMtmmlcatlon was estimated by Tla, Tib, and Tic at corrected correlation 
values of .82, .75, and •77, respectively, suggesting that the Sentences 
(Tla) conponent was the most effective estimator of 13 scores • 

T2: Bntllsh Structure ♦ This measure appears superior In several 
respects. One of the shortest objective tests (20 minutes), it ranked 
second among the nine in reliability. As shown in Table 13, English 
Structure was the best all around estimator of Essay Form, and for the two 
Spanish groups it vied with Words in Context as the best estimator of 
Interview-Communication scores. As shown in Table 8, English Structure 
**clustered" with Writing Ability, perhaps because of an emphasis on 
••standard usage^^ skills, with the two measures having corrected correla- 
tlons of .97, .96, and .82 for Peru, Chile, and Japan respectively. 

T3; Vocabulary . This measure is also efficient. A 15-mlnute test, 
its reliability is greater than that of the 40-mlnute Reading Comprehension 
section. Some of its critics have objected to this measure on the grounds 
that vocabulary testing may encourage a tendency they believe is already 
too prevalent amo&s students learning a foreign language— that of studying 
words In isolation, with the Implied neglect of other aspects of language 
learning. Actually, the Vocabulary measure has two component parts, 
Vocabulary-Sentence Completion (T3a) and Vocabulary-Synonyms (T3b). The 
former provides context but the latter does not, and is thus particularly 
vulnerable to the criticism noted above. 

For Peru and Japan, the two Vocabulary components worked about 
equally well in estimating Reading Comprehension scores and Essay Form 
scores. Among Chileans, however, the Vocabulary-Sentence Completion {T3a) 
measure was the better estimator of both criteria. 
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Uslng Intervlev*«Vncabulary measure 12 as a criterion, the two written 
Vocabulary components (T3& and T3b) were equally effective for the Peruvian 
subjects, with corrected correlations of .83 and •83. For Chileans and 
Japanese, however. Sentence Completion (T3a) was clearly a better estimator 
of Interview-Vocabulary than was the Synonyms (T3b) component, the 
corrected correlations between 12 and T3a were .79 and .74 for the latter 
two subject groups, compared to 12 versus T3b correlations of .66 and .54 
for the same groups. 

T4; Reading Comprehension . Although criterion measures of reading 
comprehension were not developed as part of the study, the study provided 
very useful information about measuring this ability. Note first that the 
Reading Comprehension section is the least efficient of those appearing in 
TOEFL; although it is one of the two longest sections, it has the lowest 
reliability. Furthermore, Reading Comprehension items are expensive to 
develop and difficult to revise for inclusion in a final test, following 
pretesting. Bearing these problems in mind, it is interesting to note in 
Table 8 the very high corrected correlations between Reading Comprehension 
(T4) and iSxperimental TOEFL measures X2, X3, and X4 of .94 to .96 for 
Peru, .93 to .98 for Chile, and .95 to .99 for Japan. If the Sentence 
Comprehension (XI) section had included items at a higher level of 
difficulty, the same would probably have been observed for that item t3rpe 
as well. Th^ implications regarding the use of one or more of these item 
types to test reading comprehension will be considered as each is discussed 
in turn. As may be seen in Table 13, Reading Comprehension scores are 
also closely approximated by the two Cloze scores. 
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T5; Writing Ability . The Essay Form measure was specifically 
developed as a writing ability criterion. As indicated in Table 13, the 
Writing Ability section estimated that criterion very well, but only to 
about the same degree as did the Words in Context measure. For the 
Chileans and Japanese, the Writing Ability section was substantially 

outperformed by English Structure as an estimator of actual writing 
performance* 

The component subtests of Writing Ability — Error Recognition and 
Sentence Completion— were less highly interrelated than other sets of 
tests, with corrected correlations of .SA .70, and ,62 for Peru, Chile, 
and Japan. These correlations are lower than most of the values observed 
between measures yielding separately reported scores, such as Writing 
Ability and Listening Comprehension! Of the two Writing Ability subtests. 
Error Recognition was superior for estimating the Essay Form criterion for 
all three subject groups. Corrected correlations between Error Recognition 
(T5a) and Essay Form (E2) were .93, •90, and .75, while those between 
Sentence Completion (T5b) and Essay Form (E2) were only .84, .70, and .55 
for the three subject groups. 

As noted above, English Structure and Writing Ability are highly 
correlated, with the two measures perhaps constituting a "standard usage" 
cluster. This would suggest that if both measures are retained, they 
might be used jointly to provide a single reported score rather than 
separate scores. 

Experimental Section XI: Sentence Comprehension * This measure is 
fully equivalent to the first portion of the TOEFL Listening section. 
Sentence Comprehension (Tla), except that it is presented in the written 
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rather than the spoken mode* It Is an efficient, highly reliable measure* 
However, the Sentence Comprehension (XI) test %ms too easy* The large 
number of participants receiving perfect or near-perfect scores depressed 
the correlations between Sentence Comprehension and other measures to some 
extent, especially for the Japanese* If It Is to be used In future 
examinations. Items should be Introduced that cover a broader range of 
difficulty* 

As the only measure of reading comprehension at the v^ry Important 
sentence level, the Sentence Comprehension Item type has much to recommend 
It* Even In Its present easy form, the measure correlates well with 
Reading Comprehension scores* It has the further advantage of face 
validity to recommend Its use as an alternative or supplement for the 
more cumbersome Reading Comprehension measure* 

Experimental Section X2: Words In Context * This Is a vocabulary 
measure In which words and expressions In sentence context provide the 
stimuli* By testing vocabulary In context, the X2 Item format reduces the 
basis for the criticism that vocabulary testing may foster the study of 
words In Isolation* Actually, each Item requires the use of two sources 
of Information to answer It fully — the meaning of the underlined word or 
expression, and the added. Interacting contextual Information* The X2 
measure, as developed, emphasizes "communicativeness" or the recognition 
or generating of equivalent messages, . rather than "well-formedness" or 
knowledge of standard usage* 

Corrected correlations between the Words in Context measure and 
Reading Comprehension were *96, *94, and *99 for the three stibject groups* 
The correlations between X2 and Cloze scores CI and C2 were almost as 
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hlgh, averaging •97> •95> and •92 £or Peruvians, Chileans, and Japanese, 
respectively (see Table 13)* Thus, Words In Context provides an excellent 
alternative to either the Reading Comprehension or the Cloze scores, both 
of vhlch are considerably less efficient and more costly to use. It does 
not, however, have as much face validity for estimating reading ability as 
has the present Reading Comprehension measure. The Words In Context (X2) 
measure Is substantially more related to the Reading Comprehension and 
Cloze measures than Is Vocabulary, which suggests that X2 does Indeed 
require the use of context In a meaningful way. 

Experimental Section X3; Combining Sentences . The Combining Sentences 
measure was developed as a multiple-choice approximation to the skills 
presumably tapped by Kellogg Hunt's Words per T-Unlt (H2) measure. None 
of the objective measures. Including Combining Sentences, correlated well 
with Words per T-Unlt. However, X3 was the best of the objective estimators 
of HI, the number of ''K's" effectively expressed. 

The Combining Sentences measure correlated well with Reading Compre- 
hension and with the Cloze scores. However, it is difficult to produce 
these Combining Sentences items, and to respond to them, so the measure 
seems less promising than the others included in the Experimental TOEFL. 

Experimental Section XA; ParaRraph Completion . The Paragraph 
Completion measure was developed as a multiple-choice equivalent of the 
Close measure. Corrected correlations between X4 and Cloze scores CI 
and C2 were high, but not quite as high as those between Words in Context 
and the Cloze scores. The Paragraph Completion measure was also highly 
correlated with Reading Comprehension, but again not quite at the level 
observed for Words in Context • 
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An advantage of the Paragraph Completion format for estimating either 
Cloze scores or Reading Comprehension is the face validity associated 
with the fact that it consists of passages of connected prose> rather than 
a set of distinct, unrelated sentences. 

Open-Ended Objective Measures 

The measures derived from the Hunt and Cloze procedures are in some 
respects quasi-criteria» for which attempts were made to develop multiple-- 
choice equivalents, yet they are not fully established as criteria in 
their own right* Neither set of measures is readily amenable to testing 
on a large scale, although some of the scores for each can be generated 
objectively. 

Hi and H2; Hunt Aluminum Passage Scores * The Hunt task of rewriting 
a passage presented in the form of 32 very short "kernel" sentences has 
been found effective in rank-order Americans on what is, presumably, 
"syntactic maturity" (Hunt, 1970b). Particularly useful for that purpose 
is the "Words per T-Unit" measure, a refinement of the traditional 
sentence-length index of language complexity. In comparison with other 
measures used in the present study, the performance of the Hunt measures 
in estimating essay scores was disappointing (see Table 11). The corrected 
correlations between Words per T-Unit (H2) and Essay Form (E2) were .52, 
•A5, and .19 for Peru, Chile, and Japan, respectively, as compared, for 
example, with Clozentropy (CI) versus Essay Form (E2) correlations of .93, 
.89, and .80, and with English Structure (T2) versus Essay Form (E2) 
correlations of •92, .98, and .81 for the same subject groups. 
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It may be that the correlations involving Words per T-Unit are 
comparatively low for foreign students because "syntactic maturity" is 
confounded with other factors • For example, among foreign students at a 
given level of "syntactic maturity" in English, individual differences in 
risk-taking behavior may result in a wide range of structural complexity 
in their responses to the Hunt rewriting task. The considerable range and 
variability in the High K's scores lend support to this interpretation. 

The Hunt task remains intriguing as an open-ended measure of English 
structure little affected by vocabulary or creativity in essay writing. 
An alternative scoring procedure, combining level of clarity (HI) with 
level of complexity (H2) was carried out, with results very similar to 
those observed for H2 alone. 

CI and C2: Cloze Passage Scores . In conjunction with either the 
Standard Cloze or the Clozentropy scoring procedure, the Cloze task 
performed very satisfactorily, particularly for estimating the Essay Form 
subjective criterion measure and the Reading Comprehension multiple-choice 
criterion measure (see Tables 11 and 13). Performance on the Cloze 
task was, in turn, very well estimated by several of the multiple-choice 
measures, including Words in Context, Paragraph Completion, and to a 
slightly lesser degree, English Structure. It will be recalled that the 
Paragraph Completion (X4) measure was designed with the express purpose of 
approximating the Cloze scores. 

The Clozentropy scoring procedure was substantially more efficient 
than the Standard Cloze procedure for low-scoring subjects. Except for 
that difference, the scores were functionally very similar. 



88 



-78- 



It is interesting to note that, as one would expect, the Clozentropy 
correlations for the written subjective criterion. Essay Form (E2), were 
substantially higher than those for the spoken criterion. Communication 
(13). The rank-ordering of corrected correlations between Clozentropy 
(CI) scores and the five TOEFL subscores is also logical, for the Chilean 
and Japanese data (see Table 13)* Among Chileans, the TOEFL subscores 
most highly correlated with Clozentropy scores were English Structure and 
Reading Comprehension (r - .91 to .93); next were Writing Ability and 
Vocabulary (r - .83 to .87); the lowest correlation was for Listening 
Comprehension (r ■ .75). Among the Japanese, the highest TOEFL subtest 
correlation with Clozentropy scores was observed for Reading Comprehension 
(r » .94); intermediate correlations were observed for English Structure, 
Vocabulary, and Writing Ability (r - .80 to .83); once again, the lowest 
value was for Listening Comprehension (r « .64)* Corrected correlations 
between Clozentropy and TOEFL subscores for Peruvians were so similar, 
ranging only from .88 to .92, that comparisons of rank order are not 
meaningful. Rank-orderings of uncorrected correlations between Clozentropy 
scores and TOEFL subscores (see Table 12) followed essentially the same 
patterns. Among Chileans, English Structure again ranked highest and 
Listening Comprehension lowest; among Japanese, Reading Comprehension 
and Listening Comprehension again ranked highest and lowest, respec- 
tively; for Peruvians, the range of correlations was again too small (.79 
to .83) to meaningfully compare rank-orderings. 

The above data conflict with those of Darnell (1970), who 'ound that 
Clozentropy ecores correlated highest with the TOEFL Listening Comprehen- 
sion scores, and second highest with Vocabulary. It should be noted that 
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the difference he obsexrved for correlations between Clozentropy and the 
two TOEFL scores just noted was minute (.736 versus .733), particularly 
given a sample of only 48 students. The fact remains, however, that among 
the TOEFL subtests. Listening Comprehension ranked highest in correlation 
with Clozentropy scores in the Darnell study and lowest in the present 
study. This difference is of interest, and an explanation should be 
sought. 
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DISCUSSION 

Limitations of the conclusions will be considered first. The 
implications of the study will then be discussed with regard to possible 
applications for the TOEFL program. 

Limitations 

In considering the results of this study, it is useful to keep in 
mind the following limitations* 

First, the study was restricted to evaluating and comparing TOEFL and 
other measures for the assessment of the foreign student's present skills 
in English as a second language that are considered Important for his 
success in an American college or university. Tnus it did not include the 
systematic use of other inputs such as previous grade*»point average or 
verbal and mathematical aptitude measured with tests in the student's 
native language. 

Second, because the study was primarily correlational, the results 
cannot be interpreted causally. They may show, for example, certain 
differences in patterns of test scores associated with differences in the 
subjects' native languages, but they cannot be used as evidence that these 
score differences are necessarily due to language background differences. 

Third, the study was focused on comparisons among item formats, and 
was not directly concerned with differences in item content within a given 
item format, which can be substantial. Items in the English Structure 
format, for example, may differ considerably in the emphasis given to 
standard usage. A new English Structure subtest differing in emphasis on 




standard usage could yield quite different results, particularly if the 
candidate groups differed in the emphasis given to formal English usage in 
their learning of English as a second language. 

A final caveat has to do with the great variety of linguistic, 
cultural, and other background variables associated with the TOEFL 
candidate population. Differences appearing among the three groups 
participating in this study only underscore the Importance of this obser- 
vation. The inclusion of subject groups of very different linguistic and 
cultural backgrounds made the results more generalizable than they 
would otherwise have been, but there remain many subject groups, such as 
Pakistanis, Israelis, and Turks, for whom certain findings may be very 
different from those obtained with Peruvian, Chilean, and Japanese subjects. 

Implications for TOEFL Content Specifications 
In this study nine TOEFL item formats and four alternative formats 
using multiple-choice questions have been evaluated for possible use in a 
revised TOEFL. The implications of the study for each format will first 
be examined. This will be followed by some suggestions regarding the 
content specifications that might be used for a revised TOEFL. 

Section Tl, Listening Comprehension, was found to be relatively 
independent of the other objective measures, and to correlate well with 
spoken Communication (13). These findings add empirical support to the 
logical reasons for retaining Listening Comprehension as a separate 
measure. Although the Sentences (Tia) and Dialogues (Tib) components 
are highly interdependent, face validity considerations would suggest that 
both of these as well as the Lecture (Tic) component, be retained in the 
Listening Comprehension section. 
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Sectlon T2» English Structure » correlated highly with the language 
production criteria, l«e«» spoken Conusunlcatlon (13) and Essay Form (E2)» 
as well as with Writing Ability (T5)« The Implication of the first two 
correlations Is that the English Structure format clearly should be 
retained* The third correlation suggests that English Structure and 
Writing Ability scores could well be combined* 

Section T3, Vocabulary, was observed to be highly efficient, requiring 
only 15 minutes testing time to yield a satisfactory reliability* Within 
the Vocabulary section, the Sentence Completion format was superior to 
the Synonyms format for estimating Reading Comprehension (T4), Essay Form 
(E2), and spoken Vocabulary (12)* The Synonyms format, therefore, should 
probably be dropped, particularly since it is already suspected of increas- 
ing the tendency to study words in isolation* The very high correlations 
between Vocabulary (T3) and Reading Comprehension (TA) suggest that these 
scores, too, should be combined* 

Section T5, Writing Ability, correlated with Essay Form (E2) with 
values of *93, *88» and *73 (corrected for attenuation) for Peru, Chile, 
and Japan* As not^td earlier, the Error Recognition (T5a) component was 
considerably more valid than the Sentence Completion (T5b) part for all 
three subject groups, havlicig correlations with the Essay Form (E2) 
criterion of *93, *90, and *75, respectively* The suggestion that the 
Writing Ability section (or its Error Recognition subsection) be replaced 
by an actual writing sample received little statistical support from the 
Peruvian and Chilean datiK, but received rather more support from the 
Japanese data* The suggestion, of course, has pedagogical as well as 
statistical bases, and the data presented here apply only to the latter* 
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The relationships between Writing Ability and English Structure scores, 
and Its implication was noted on page 82* 

Section XI, (written) Sentence Comprehension, proved very easy, and 
this reduced Its correlations with other measures* As an efficient 
measure of reading comprehr^nslon at the sentence level. It Is worthy of 
further study* A Sentence Comprehension section covering an adequate 
range of Item difficulties could be developed and tried out for possible 
Inclusion In a revised TOEFL* Increasing the difficulty of Sentence 
Comprehension Items could be readily accomplished by using more difficult 
statements or questions, and by Introducing answer choices that call for 
finer discriminations* 

Section X2, Words In Context, showed extremely high correlations with 
Reading Comprehension and Cloze scores* An Implication of the former Is 
that the Words In Context format could be used to supplement the less 
efficient Reading Comprehension measure* 

Section X3, Combining Sentences, was relatively difficult to develop, 
and It was outpei formed by other measures as an estimator of various 
criterion scores.* It should be dropped from consideration for inclusion 
in a revised TOEFL* 

Section XA, Paragraph Completion, was also outperformed by other 
multiple-^cholce measures, even for estimating Cloze scores* It, too, 
should probably be dropped from present consideration for TOEFL* 

From the above discussion, it would appear that four of the measures 
studied would not be likely prospects for inclusion in a revised TOEFL* 
These are the Vocabulary/Synonyms (T3b) and Writ|,ng Ability/Sentence 
Completion (T5b) subtests from TOEFL, and the Combining Sentences (XI) and 
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Paragraph Cospletlon (X4) naasuret froa Experimental TOEFL* The Reading 
CoMprehentlon ataaure might be reduced from 40 to perhapa 20 or 30 mlnutaa» 
and aupplcmented by another measure having face validity for reading 
comprehension » such as Sentence Comprehension (XI) • The status of Vocab- 
ulary/Sentence Completion (T3a) Is uncertain* 

Implications for the Number of Scores to Report 
The question* "How many scores should be reported for TOEFL?" was 
raised at the beginning of this paper » and It was noted that the answer 
depended not only on the logical distinctions among component skills, but 
on hov Independent these skills arei In fact» for foreign students* Later 
In the paper It was observed that the Listening Comprehension subtests are 
relatively Independent of the other multiple-choice measures, and that the 
English Structure and Writing Ability measures form one cluster and the 
Reading Comprehension and Vocabulary measures form another* These and 
other considerations suggest a revised TOEFL having several components, 
but yielding only three scores: I* Listening Comprehension, II* English 
Structure and Writing Ability, and III* Reading Comprehension and Vo- 
cabulary in Context* 

All of the above comments regarding implications for TOEFL content 
specifications derive from the data gathered in the study, with some 
additional consideration given to face validity* As such, they are 
subject to the limitations of the study noted earlier* Ftirthermore, any 
consideration of these or other suggestions regarding possible changes in 
TOEFL specifications must also take into account such questions as cost, 
timing, and the acceptance of the changes by TOEFL score users* 
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APPENDIX A: Materials for Scoring Hunt's Aluminum Passage 



I. Score Sheet 



SUBJECT 





Sent. 


T-Units 


Words 




1 








2 








3 








4 








5 








6 








7 








8 








9 








10 








11 








12 








13 








14 








15 








16 








17 








18 








19 








20 






TOTALS 









SCORER 



K 

1. Aluminum is a metal 

2. It is abundant 

3. It has many uses 

4. It comes from bauxite 

5. Bauxite is an ore 

6. B. looks like clay 

7. B» contains aluminum 

8. It contains several 

other substances 

9. Workmen extract these 

other substances 
from the B. 

10. They grind the B. 

11. They put it in tanks 

12. Pressure is in the 

tanks 

13. The other substances 

form a mass 

14. They remove the mass 

15. They use filters 

16. A liquid remains 

17. They put it through 

several other 
processes 

18. It finally yields a 

chemical 

19. The chemical is 

powdery 

20. It is white 

21. The chemical is 

alumina 

22. It is a mixture 

23. It contains aluminum 

24. It contains oxygen 

25. Workmen separate the 

aluminum from the 
oxygen 

26. They use electricity 

27. They finally produce 

a metal 

28. The metal is light 

29. It has a luster 

30. The luster is bright 

31. The luster is silvery 

32. This metal comes in 

many forms. 



T- 
Unit 
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TOTALS 



Adequacy of K 



High 
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APPENDIX A - Continued 

II. Abbreviated Instructions for Scoring Aluminum Passage Protocol 

A. Mark the end of each T-unlt with a red slash . 

!• Hunt (p. 4) defines the T-unlt as "one main clause plus any sub- 
ordinate clause or nonclausal structure that Is attached to or 
embedded In It." 

2. For the purpose of scoring, each sentence, even If It Is a fragment. 
Is assumed to contain at least one T-unlt. Thus, you will always 
Indicate a T-unlt boundary at the end of each sentence. 

3« Wlthln-sentence T-unlt boundaries will occur whenever two adjoining 
clauses are Independent. If, on the other hand, one of the 
adjoining clauses Is subordinate to the other, they are both part 
of the same T-unlt. 

4. Writers may use an Inappropriate word to join two clauses. In 
that case, make the best judgment based on the sense of the 
passage, of whether the clauses are Independent* Example (#7): 
"it has many uses why It Is abundant..." Here, the "why" seems to 
Imply subordination^ as though "because" had been used. Thus, the 
two phrases are considered part of the same T-unlt. 

5« Extraneous T-unlts (those not Involving at least one JK) are 

treated the same as other T-unlts* Note them at the bottom of the 
score sheet, however. 

6. T-unlt fragments will not ordinarily occcur, because a given 
T-unlt Includes any attached or embedded nonclausal structure 
(A-1). They can occur In at least two Instances, however. 

a. When a segment punctuated as a sentence Is actually less than 
a clause. 

b. When there Is an Incomplete sentence, which clearly Includes a 
complete T and a T-fragment. 

Indicate T-unlt fragments at the bottom of the score sheet. 
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APPENDIX A - Continued 



Evaluate the "Adequacy of K" 

1. After each K, check one of the 4 categories: HI, Med, Low, or 
Abs. The meaning of each category Is roughly as follows (consult 
B-2 for details): 

a. HI. Check this category If the K Is clearly stated. It does 
not need to be stated In standard English. 

b. Med . Check this category If the Information of K seems to 
have been correctly presented, but with some ambiguity. A 
useful rule of thumb Is to mark the K down to "Med" If careful 
reading was required before the meaning was clear. 

c- Low . If a K Is ambiguously stated, or Is misstated, or Is 
misleading, mark It "Low". Especially In cases of poor 
writing. It may be difficult to decide between "Low" and 
"Abs." Here, the basic rule Is to judge whether the writer 
attempted to express the K. If the judgment Is "yes," then 
check that K as "Low," not as "Abs." 

d. Abs . If the K was not referred to In any T or combination of 
T's check "Abs." 

2. Several problems come up regularly In evaluating "Adequacy of K." 
Each will be considered below. 

a. Omissions* Omissions may be either deliberate or Inadvertent. 
For evaluating whether a writer has adequately expressed a K 
by way of Inference, It Is useful to distinguish between 
functional redundancy and non-functional redundancy. The for- 
mer Is needed for clarity of the message, but the latter Is 
not. 

b. Substitutions or paraphrases. In evaluating substitutions and 
paraphrases > the main criterion again Is whether the message 
Is clear and essentially unchanged. To facilitate making 
these decisions, a worksheet is attached, with recurring sub- 
stitutions evaluated. Add to the list from time to time, 

as decisions are made on other recurring substitutions. 
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APPENDIX A Continued 

c» Added Information * Added information will vary on two scales, 

relatedness and correctness • If basically unrelated to the K's, it 
is disregarded when evaluating K's, Similarly, if related and 
correct, it does not influence the evaluation • Added infor- 
mation that is related to a K, and tends to obscure, distort, 
or contradict the original message, should result in a lowered 
"adequacy" evaluation* 

d. Duplicated K^s * If a K is expressed in two places, and the two 
versions are not contradictory, record the judged adequacy of the 
better version. If the two versions of a K are contradictory, 
grade down according to the amount of confusion or distortion 
introduced. "Low" will usually be assigned. (For example. Protocol 
24 indicates that the chemical is liquid in T-8, and that it is a 
powder in T-90 

e. Misspellings . Misspellings are not penalized, unless they obscure 
meaning . 

Note: Some will write "alumin," rather than "aluminum" 
or "alumina." Mark down for this. 

f. Unclear or inappropriate antecedents for reference words . This 
problem occurs frequently. Depending on the resulting misinforma- 
tion or lack of clarity, check "Med" or "Low." 
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I. 



APPENDIX B 
Materials for Scoring Interviews 
Interview Rating Sheet 





Narration/ 
General 
Conver 8 ' n 


Technical/ 
Academi c/ 
Vocational 


Overall 


Accent 
Grammar 
Vocabulary 
Fluency 
Commun Ic . 

































Lis tene rj 

List. Date 

Ss Name 



Ss TOEFL No 



Location : Tokyo 
(Circle Lima 

one . ) San t iago 



Accent 


Grammar 




Vocabulary (Nar., Gen c.) 


Vocabulary (Tech, A, V) 


Fluency (N, Gc) 


Fluency (T, A, V) 


Communic. (N, Gc) 


Communic. (T, A, V) 


General Comments: 
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APPENDIX B - Continued 



!!• Proficiency Descriptions 

A. Accent 

!• Pronunciation is generally unintelligible. 

2. Frequent gross errors and a very heavy accent make understanding 
difficult. Meaning of some portions of interview lost or greatly 
obscured. 

3. Marked "foreign accent" requires concentrated listening, and leads 
to some lack of clarity in the message and apparent errors in 
grammar or vocabulary. 

4* A noticeable accent and occasional mispronunciations are present 
but do not interfere with understanding* 

5. No conspicuous mispronunciations, but would not be taken for a 
native speaker. 

6. Native pronunciation, with no trace of foreign accent. 

B. Grammar 

1. Shows virtually no command of grammar except for stock phrases* 

2. High error rate or use of only very few grammatical patterns, 
frequently limiting or impairing communication. 

3« Error rate or limited use of grammatical patterns causes occasional 
misunderstanding. Control of basic patterns is shown. 

4. Errors generally limited to those having little effect on under- 
standing. Subject uses longer, more complex sentences, and makes 
effective use of expressions such as "would," "should" "as 

soon as," etc. 

5. No systematic errors that influence understanding. Subject 
demonstrates command of nearly the same range of grammatical 
usages that would be expected of an American college student in 
the same interview. 

6. Subject's grammar is indistinguishable from that of a native 
English speaker at the same level of education. 
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APPENDIX B - Continued 



C» Vocabulary 
General 

1* Vocabulary Inadequate for 

even the simplest conversation 
or narration* 



1. Vocabulary Is very 
limited even In such 
basic topics as family, 
home, travel, and time* 
Narrative (for Latin 
American subjects) Is 
severely limited because 
of vocabulary problems* 

3» Vocabulary Is adequate 
for only a limited 
range of general topics* 
General conversation 
or narrative Is weakened 
by Inaccuracies and 
limitations In vocabulary* 

4. Vocabulary permits dis- 
cussion of a wide range 
of general topics, with 
circumlocutions • 
Narrative Is not noticeably 
ably curtailed by vocabulary • 

5* Considerable range and 
depth of vocabulary Is 
demonstrated In general 
conversation or narration • 
The subject would not be 
mistaken for a native 
English speaker, but 
vocabulary limitations 
are minor* 

6. Vocabulary Is apparently 
as accurate and extensive 
as that of a native speaker* 



Academic/Professional 

1* Academic or professional 
vocabulary limited to 
a handful of words. 
Insufficient for 
conversation* 

2* Discussion In professional 
or academic areas Is 
severely limited* 



3* Choice of words sometimes 
Inaccurate, limitations 
of vocabulary prevent 
discussion of some 
common professional 
or academic topics* 

4* Vocabulary Is adequate 
for a moderate range 
and depth of discussion 
In academic or professional 
areas* 

5* Vocabulary permits 
extensive discussion 
of subject's professional 
area* Though the subject 
Is not a "native speaker," 
his/her vocabulary 
presents little If any 
barrier to communication* 

6* Vocabulary Is apparently 
as accurate and extensive 
as that of a native speaker* 
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Appendix B - Continued 



D* Fluency (General and Technical Conversation) 

1* Speech is so halting and fragmentary that communication is virtually 
impossible* 

!• Speech is very slow and uneven except for short or routine sentences* 

3* There are smooth portions but speech may become hesitant, Jerky, 
or sprinkled with "ah***," "er***," etc* 

4* Speech is generally smooth with some pauses for rephrasing, 
groping for words, etc* 

5* Speech is effortless and smooth but perceptibly non*native* 

6* Speech on all topics is as effortless and smooth as that of a 
native speaker* 

E* Communication (General Conversation) 

1* Subject's ability to communicate even very basic information is 
virtually nil* This is evident both in his responses, per se, and 
in his apparent inability to understand the interviewer's state- 
ments* 

2* Interviewer must speak slowly and simply to subject* Subject's 
communicating of his/her own ideas is limited to short, simple 
sentences, often too fragmentary to be understood* 

3* Subject understands a good deal of what is said to him with some 
simplification and rephrasing on part of speaker* In his/her o%m 
conversation, subject has some difficulty in getting the message 
across, relying on awkward circumlocutions, rephrasing, and 
occasional help from the interviewer* 

4* Subject understands quite well normal speech both in general and 
technical areas* With occasional repetition and rephrasing, the 
subject can continue a conversation or a narrative; however, 
lack of control of some speech patterns and vocabulary limitations 
prohibit very sophisticated discussion* 

5* Subject communicates both simple and complex ideas effectively* 
Re/she may occasionally resort to circumlocutions or ei^ress self 
in a some%ihat "foreign" way (e*g'», by using nonstandard word-order), 
but these deficiencies impose essentially no limitations on what 
he/she discusses, or on the clarity of what he/she says* 

6* The subject's ability to communicate is indistinguishable from 
what would be expected from an American counterpart* 
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Appendix B - Continued 



3. Training Tapes 

Training Tape (Cam) 

Accent - Subject spoke with decided Spanish accent, often making 

such mispronunciations as "dere" for there and "dey" for 
they: This factor did not Interfere with understanding . 

Grammar - She used a variety of tenses, both simple and compound 
("you promised mt that you would pay back...**) with a 
high degree of success. Some awkwardness with such 
phrases as: **the fox, he gets mad" or "In few seconds." 

Vocabulary - The narrative part of the Interview demonstrated a large 
vocabulary (such terms as "rural," "tropical," and an 
Idiomatic, "Take It easy"). Subject was able to 
elaborate rather than Just answer a question. 



Fluency - Some groupings and hesitations Interrupted the 

smoothness of the speech. 

Communication - Speaker had no problem understanding Interviewer. Her 
own narrative attempt was well developed and cohesive 
so that she got a rather complex message across. 



Accent 

Grammar 

Vocabulary 

Fluency 

Communication 



General 
Conversation 

X 

X 

5 

4+ 
5 



Academic 
Conversation 

X 

X 

4+ 

4 

5 



Overall 
4 

5- 
5 

4+ 
5 



ERLC 



107 



-97 



Appendix B - Continued 



Training Tape (Shibuya) 



Accent - 



Grammar - 



Vocabulary - 

Fluency - 
Communication 



Marked Japanese accent vhich demands a concentrated 
listening • Mispronunciations common ("har douter" 
for her daughter » "bisit" for visit, "perfrectry" 
for perfectly) and tends to drop final consonants of 
some vords* 

Uses present and simple past tenses vhile making some 
errors in subject-verb agreement and present perfect 
tense. Tends to ansver questions in fragments 
whenever possible. 

Subject has a certain facility with his limited 
vocabulary, but avoids discussing any complex general 
or academic matters. 

Frequently hesitant, uses fragments. 

Subject appears reluctant or unable to complicate his 
simple conversation with the interviewer. Interviewer 
must repeat or rephrase some simple questions (e.g., 
"Tell me about the place you live'*). Initially, 
subject confused "export" with "transport" but then 
corrected himself. 



Accent 

Grammar 

Vocabulary 

Fluency 

Communication 



General 
Conversation 

X 

X 

3 

3 

3 



Academic 
Conversation 

X 

X 

3 

3 

3 



Overall 
3 
3 
3 
3 
2 



I OS 



ERLC 



-98- 



Appendix B - Continued 



Training Tape (Vaca) 



Accent - 



Granunar — 



Vocabulary - 



Fluency - 



Accent extremely thick* Listener must strain to 
understand and still falls to catch some phrases* A 
typical pronunciation example: "did" for died* 

Subject omits articles, often uses present tense when 
past Is desired (although can use some simple past 
tenses) and lapses into his native tongue quite 
readily* He speaks in fragments* 

Range of words extremely limited* Could use words 
like "brother," "chair," and 'Vork," but faltered on 
more technical words* 

Subject^s speech was hesitant, stammering with much 
fumbling for vocabulary and much repetition; in a 
word, non-fluent* 



Comnunication - In spite of grammatical and vocabulary problems, 
subject was able to understand and get across soi 
Information in English during the interview. 



General 
Conversation 



Academic 
Conversation 



Overall 



Accent 



1+ 



Grammar 
Vocabulary 
Fluency 
Communication 



X 

2- 
1 

2- 



X 

2- 

1 

1+ 



2- 
2- 
1 

2- 



100 



-99- 



Appendlx B * Continued 



Training Tape (Tokyo 42) 

Accent- Very slight. Although It would be impossible to say 

which country the subject is from, "an alert listener" 
would detect that English is not her native language* 

Grammar- Very few mistakes. Only prominent error occurred In 

the subject's description of her favorite teacher 
("The thing I liked Is she gives us homework..."). 
Several minor mistakes. Interviewee used many 
and varied complex constructions. 

Vocabulary- Extensive, with good knowledge of Idioms and common 
usage ("pretty large yard," "on top of a hill," 
"something like that," "manage to accomodate"). 

Fluency- Pauses only as much as a native speaker might. 

Communication- Understands virtually everything said to her, and her 
own statements arc complete and clear. 



Accent 

Grammar 

Vocabulary 

Fluency 

Communication 



General 
Conversation 

X 

X 

6 

6 

6 



Academic 
Conversation 

X 

X 

6- 

6 

6 



Overall 
5+ 
6- 
6 
6 
6 



ERLC 



110 



